Award  Number:  DAMD17-98-2-8005 


TITLE:  Malaria  Genome  Sequencing  Project 

PRINCIPAL  INVESTIGATOR:  Malcolm  J.  Gardner,  Ph.D. 

CONTRACTING  ORGANIZATION;  The  Institute  for  Genomic  Research 

Rockville,  Maryland  20850 

REPORT  DATE:  January  2003 

TYPE  OF  REPORT:  Annual 

PREPARED  FOR:  U,S.  Army  Medical  Research  and  Materiel  Command 
Fort  Detrick,  Maryland  21702-5012 

DISTRIBUTION  STATEMENT:  Approved  for  Public  Release; 

Distribution  Unlimited 


The  views,  opinions  and/or  findings  contained  in  this  report  are 
those  of  the  author (s)  and  should  not  be  construed  as  an  official 
Department  of  the  Army  position,  policy  or  decision  unless  so 
designated  by  other  documentation. 


20030509  051 


REPORT  DOCUMENTATION  PAGE 


Form  Approved 
0MB  No.  074-0188 


Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  Including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and  maintaining 
the  data  needed,  and  completing  and  reviewing  this  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information,  including  suggestions  for 
reducing  this  burden  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington,  VA  22202-4302,  and  to  the  Office  of 
Management  and  Budget,  Paperwork  Reduction  Project  (0704-0188),  Washington,  DC  20503 


1.  AGENCY  USE  ONLY  {Leave  blank) 


4.  TITLE  AND  SUBTITLE 


2.  REPORT  DATE 

January  2003 


Malaria  Genome  Sequencing  Project 


3.  REPORT  TYPE  AND  DATES  COVERED 

Annual  (17  Dec  01  -  16  Dec  02) 


5.  FUNDING  NUMBERS 

DAMD17-98-2-8005 


6.  AUTHOR(S)  : 

Malcolm  J.  Gardner,  Ph.D. 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS{ES) 

The  Institute  for  Genomic  Research 
Rockville,  Maryland  20850 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


Email:  gardner(gtigr.org 


9.  SPONSORING  /  MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

U.S.  Army  Medical  Research  and  Materiel  Command 
Fort  Detrick,  Maryland  2 1 702-501 2 


10.  SPONSORING  /  MONITORING 
AGENCY  REPORT  NUMBER 


1 1 .  SUPPLEMENTARY  NOTES 

Original  contains  color  plates:  All  DTIC  reproductions  will  be  in  black  and  white. 


12a.  DISTRIBUTION  /  AVAILABILITY  STATEMENT 

Approved  for  Public  Release;  Distribution  Unlimited 


12b.  DISTRIBUTION  CODE 


Ahctrant  200  Words)  (abstract  should  contain  no  proprietary  or  confidential  information) 

The  objectives  of  this  5-year  Cooperative  Agreement  between  TIGR  and  the  Malaria  Program, 
NMRC,  were  to:  Specific  Aim  1,  sequence  3.5  Mb  of  P.  falciparum  genomic  DNA; 

Specific  Aim  2,  annotate  the  sequence;  Specific  Aim  3,  release  the  information  to  the 
scientific  community.  Two  additional  Specific  Aims  were  added  to  the  Cooperative 
Agreement:  Specific  Aim  4,  sequencing  of  P.  yoGlli  to  5X  coverage;  Specific  Aim  5, 
sequencing  of  P.  vivax  to  5X  coverage.  This  year  we  reached  a  major  milestone  by 
publishing,  in  collaboration  with  the  Sanger  Institute  and  Stanford  University,  the 
complete  genome  sequence  of  P.  falciparum  in  the  journal  Nature .  In  addition,  we  published 
a  comparative  analysis  of  the  genome  of  the  rodent  malaria  parasite  P.  yoelil  with  that  of 
P.  falciparum.  We  began  sequencing  of  the  second  major  human  malaria  parasite  P.  vivax  and 
attained  5X  coverage  of  the  genome.  We  obtained  additional  funds  from  other  sources  to 
permit  the  sequencing  of  P.  vivax  to  8X  coverage,  to  close  one-third  of  the  genome,  and  to 
annotate  the  genome  and  compare  it  to  the  genomes  of  P.  falciparum  and  P.  yoeiii.  This 
work  will  be  completed  under  a  12-month  no-cost  extension  of  this  Cooperative  Agreement. 
Discussions  with  the  Malaria  Program,  NMRC  aimed  at  development  of  a  program  to  use 
genomics  and  functional  genomics  to  accelerate  vaccine  research  are  in  progress. 


14.  SUBJECT  TERMS: 

P.  falciparum,  P.  vivax,  P.  yoeiii,  malaria. 


17.  SECURITY  CLASSIFICATION 
OF  REPORT 

Unclassified 


NSN  7540-01-280-5500 


18.  SECURITY  CLASSIFICATION 
OF  THIS  PAGE 

Unclassified 


genome ,  chromosome 


19.  SECURITY  CLASSIFICATION 
OF  ABSTRACT 

Unclassified 


15.  NUMBER  OF  PAGES 
51 


16.  PRICE  CODE 


20.  LIMITATION  OF  ABSTRACT 


Unlimited 


Standard  Form  298  (Rev.  2-89) 

Prescribed  by  ANSI  Std.  Z39-18 


1. 


Table  of  Contents 

Front  cover . 1 

SF298 . 2 

Table  of  Contents . 3 

Introduction . 4 

Body . 4 

Sequencing  of  P.  falciparum  chromosomes  10, 11,  and  14  (Specific  Aims  1, 2, 

3) . 6 

Annotation  and  publication  of  the  P.  falciparum  genome  sequence  (Specific 

Aims  2  and  3) . 2 

Sequencing  of  P.  yoelii  to  5X  coverage  (Specific  Aim  4) . 7 

Sequencing  of  P.  wVaxto  5X  coverage  (Specific  Aim  5) . 7 

Proteomics  studies . 8 

Key  Research  Accomplishments . 9 

Reportable  Outcomes . 9 

Conclusions . 10 

References . 11 

Appendices . 12 


3 


Introduction 


Malaria  is  caused  by  apicomplexan  parasites  of  the  genus  Plasmodium.  It  is  a 
major  public  health  problem  in  many  tropical  areas  of  the  world,  and  also  affects  many 
individuals  and  military  forces  that  visit  these  areas.  In  1994  the  World  Health 
Organization  estimated  that  there  were  300-500  million  cases  and  up  to  2.7  million 
deaths  caused  by  malaria  each  year,  and  because  of  increased  parasite  resistance  to 
chloroquine  and  other  antimalarials  the  situation  is  expected  to  worsen  considerably. 
These  dire  facts  have  stimulated  efforts  to  develop  an  international,  coordinated 
strategy  for  malaria  research  and  control  \  Development  of  new  drugs  and  vaccines 
against  maiaria  will  undoubtedly  be  an  important  factor  in  control  of  the  disease. 
However,  despite  recent  progress,  drug  and  vaccine  development  has  been  a  slow  and 
difficult  process,  hampered  by  the  complex  life  cycle  of  the  parasite,  a  limited  number  of 
drug  and  vaccine  targets,  and  our  incomplete  understanding  of  parasite  biology  and 
host-parasite  interactions. 

The  advent  of  microbiai  genomics,  i.e.  the  ability  to  sequence  and  study  the 
entire  genomes  of  microbes,  should  accelerate  the  process  of  drug  and  vaccine 
development  for  microbial  pathogens.  As  pointed  out  by  Bloom,  the  complete  genome 
sequence  provides  the  “sequence  of  every  virulence  determinant,  every  protein  antigen, 
and  every  drug  target”  in  an  organism  and  establishes  an  excellent  starting  point  for 
this  process.  In  1995,  an  international  consortium  including  the  National  Institutes  of 
Health,  the  Wellcome  Trust,  the  Burroughs  Wellcome  Fund,  and  the  US  Department  of 
Defense  was  formed  (Malaria  Genome  Sequencing  Project)  to  finance  and  coordinate 
genome  sequencing  of  the  human  malaria  parasite  Plasmodium  falciparum,  and  later,  a 
second,  yet  to  be  determined,  species  of  Plasmodium.  Another  major  goal  of  the 
consortium  was  to  foster  close  collaboration  between  members  of  the  consortium  and 
other  agencies  such  as  the  World  Health  Organization,  so  that  the  knowledge 
generated  by  the  Project  could  be  rapidly  applied  to  basic  research  and  antimalarial 
drug  and  vaccine  development  programs  worldwide.  Participating  centers  include  the 
Naval  Medical  Research  Center,  the  Wellcome  Trust  Sanger  Institute,  and  the  Stanford 
University  Genome  Technology  Center. 


Body 

This  report  describes  progress  in  the  Malaria  Genome  Sequencing  Project 
achieved  by  The  Institute  for  Genomic  Research  and  the  Malaria  Program,  Naval 
Medical  Research  Center,  under  Cooperative  Research  Agreement  DAMD1 7-98-2- 
8005,  over  the  12  month  period  from  Dec.  ’01  to  Dec  ’02.  The  Specific  Aims  of  the  work 
supported  by  this  agreement  are  listed  below.  Specific  Aims  1-3  were  contained  in  the 
original  Cooperative  Agreement.  Specific  Aims  4-5  were  added  to  the  Cooperative 
Agreement  through  modifications. 
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The  Cooperative  Agreement  was  initially  scheduled  to  expire  in  December  2002. 
However,  we  were  recently  granted  a  12-month  no-cost  extension  to  allow  us  to 
complete  a  newly-expanded  Specific  Aim  5  (Sequencing  of  P.  wVaxto  5X  coverage). 

1.  Determine  the  sequence  of  3.5  megabases  of  the  P.  falciparum  genome 
(clone  3D7): 

a)  Construct  small-insert  shotgun  libraries  (1-2  kb  inserts)  of  chromosomal  DMA 
isolated  from  preparative  pulsed-field  gels. 

b)  Sequence  a  sufficiently  large  number  of  randomly  selected  clones  from  a 
shotgun  library  to  provide  10-fold  coverage  of  the  selected  chromosome. 

c)  Construct  PI  artificial  chromosome  (PAC)  libraries  (inserts  up  to  20  kb)  of 
chromosomal  DMA  isolated  from  preparative  pulsed-field  gels. 

d)  If  necessary,  generate  additional  STS  markers  for  the  chromosome  by  i) 
mapping  unique-sequence  contigs  derived  from  assembly  of  the  random  sequences  to 
chromosome,  ii)  mapping  end-sequences  from  chromosome-specific  PAC  clones  to 
YACs. 


e)  Use  TIGR  Assembler  to  assemble  random  sequence  fragments,  and  order 
contigs  by  comparison  to  the  STS  markers  on  each  chromosome. 

f)  Close  any  remaining  gaps  in  the  chromosome  sequence  by  PCR  and  primer¬ 
walking  using  P.  falciparum  genomic  DMA  or  the  YAC,  BAG,  or  PAC  clones  from  each 
chromosome  as  templates. 

2.  Analyze  and  annotate  the  genome  sequence: 

a)  employ  a  variety  of  computer  techniques  to  predict  gene  structures  and  relate 
them  to  known  proteins  by  similarity  searches  against  databases;  identify  untranslated 
features  such  as  tRNA  genes,  rRNA  genes,  insertion  sequences  and  repetitive 
elements:  determine  potential  regulatory  sequences  and  ribosome  binding  sites;  use 
these  data  to  identify  metabolic  pathways  in  P.  falciparum. 

3.  Establish  a  publicly-accessible  P.  falciparum  genome  database  and 
submit  sequences  to  GenBank. 

4.  Perform  whole  genome  shotgun  sequencing  of  the  rodent  malaria 
parasite  Plasmodium  yoelii  to  3X  coverage,  assemble  into  contigs,  annotate  the 
contigs,  make  the  data  available  on  the  TIGR  web  site,  and  submit  the  data  to 
GenBank. 

5.  Perform  whole  genome  shotgun  sequencing  of  the  human  malaria 
parasite  Plasmodium  vivax  to  5X  coverage,  assemble  the  contigs,  annotate  the 


5 


contigs,  make  the  data  available  on  the  TIGR  web  site,  and  submit  the  data  to 
GenBank. 


We  are  pleased  to  report  that  excellent  progress  has  been  made  towards 
achievement  of  these  goals.  In  previous  annual  reports  we  announced  the  publication  in 
Science  of  the  first  complete  sequence  of  a  malarial  chromosome  (chromosome  2) 
development  of  a  Plasmodium  gene  finding  program,  GlimmerM  introduction  of 
optical  restriction  mapping  technology  for  rapid  mapping  of  whole  Plasmodium 
chromosomes  ®  completion  of  the  random  phase  of  sequencing  of  3  additional  P. 
falciparum  chromosomes  and  major  progress  in  gap  closure;  sequencing  of  the  rodent 
malaria  parasite  Plasmodium  yoelii  to  5X  coverage  and  release  of  preliminary 
annotation  of  this  genome  on  the  TIGR  web  site 

(http://www.tigr.org/tdb/edb2/pya1/htmls/).  Through  a  subcontract  to  Dr.  John  Yates  at 
the  Scripps  Institute,  we  also  assisted  NMRC  in  their  pilot  project  to  apply  the 
techniques  of  proteomics  towards  the  identification  of  novel  antigens  in  parasite 
(sporozoite)  extracts.  Finally,  we  have  continually  reviewed  with  NMRC  further  steps 
that  can  be  taken  to  more  rapidly  apply  Plasmodium  genomics,  functional  genomics, 
and  proteomics  to  problems  of  vaccine  development  for  malaria. 

Over  the  past  year,  we  have  completed  specific  Aims  1 ,  2,  3,  and  4. 
Chromosomes  10, 11 ,  and  14  have  been  completed,  and  in  collaboration  with  the 
Sanger  Institute,  Stanford  University,  and  NMRC,  the  complete  P.  falciparum  genome 
sequence  was  published  in  Nature  on  Oct.  3,  2002  ^’®.We  also  completed  the 
sequencing  of  P.  yoelii  to  5X  coverage,  and  published  this  sequence  and  a  comparison 
to  P.  falciparum  Finally,  we  have  continued  to  sequence  the  genome  of  P,  vivax  and 
have  attained  6X  coverage.  This  data  was  released  to  the  public  on  the  TIGR  website. 

Sequencing  of  P.  falciparum  chromosomes  10, 11,  and  14  (Specific  Aims  1, 

2,  3) 

Sequencing  of  chromosome  1 0, 1 1 ,  and  14  was  funded  primarily  by  grants  from  the 
NIAID  (chromosomes  10  and  11)  and  the  Burroughs  Wellcome  Fund  (chromosome  14). 
Funds  from  this  collaborative  agreement  were  used  to  accelerate  the  sequencing,  assist 
in  closure  and  annotation,  and  facilitate  rapid  utilization  of  the  sequence  data  by  the 
DoD  vaccine  and  drug  development  groups.  In  previous  years  we  described  the 
isolation  of  chromosomal  DNA,  preparation  of  shotgun  libraries,  random  sequencing, 
assembly,  gap  closure,  production  and  public  release  of  preliminary  annotation.  This 
past  year  focused  primarily  on  gap  closure  and  the  final  annotation  and  publication  of 
the  P.  falciparum  genome  sequence  in  collaboration  with  the  other  members  of  the  P. 
falciparum  genome  consortium. 

All  gaps  in  chromosomes  10, 1 1 ,  and  14  have  now  been  closed.  In  the  next  few 
months  we  will  be  submitting  the  revised  sequences  and  updated  annotation  to 
GenBank  and  PlasmoDB. 
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Annotation  and  publication  of  the  P.  falciparum  genome  sequence  (Specific 
Aims  2  and  3) 

In  last  year’s  report  we  described  an  agreement  made  with  the  Sanger  Institute 
and  Stanford  University  to  collaborate  on  the  joint  analysis  and  publication  of  entire  P. 
falciparum  genome  sequence.  This  whole  genome  overview  was  to  be  accompanied  by 
a  series  of  papers  by  each  sequencing  center  on  the  chromosomes  sequenced  by  each 
group.  The  whole  genome  overview  and  chromosome  papers  were  to  be  published  in  a 
single  issue  of  a  journal.  In  addition,  a  comparative  analysis  of  the  P.  falciparum  and  P. 
yoe/// genomes  based  upon  the  5X  coverage  P.  yoe/// sequence  was  to  be  be  published 
along  with  the  P.  falciparum  papers. 

The  principal  investigator  of  this  agreement  was  selected  to  be  the  coordinator  of 
the  annotation  effort  and  the  lead  author  on  the  final  publication.  Furthermore,  TIGR 
was  chosen  to  be  the  central  repository  of  all  the  P.  falciparum  genfome  data.  Over  the 
past  year,  TIGR  collected  the  chromosome  sequences  and  associated  annotation  from 
the  other  sequencing  centers  and  coordinated  the  analysis  of  the  genome  sequence 
and  the  preparation  of  whole  genome  and  a  series  of  chromosome  manuscripts  for 
publication.  The  manuscripts  were  submitted  for  publication  in  July  2002  and  published 
in  Nature  on  Oct.  3,  2002  '’®. 


Sequencing  of  P.  yoelii  to  5X  coverage  (Specific  Aim  4) 

A  secondary  goal  established  at  the  initiation  of  the  malaria  genome  project  was 
to  sequence  the  genome  of  a  another  species  of  Plasmodium  so  as  to  be  able  to 
perform  a  series  of  comparative  analyses. 

After  discussions  with  NMRC  we  elected  to  proceed  with  sequencing  of  P.  yoelii. 
Reductions  in  the  costs  of  sequencing  allowed  us  to  perform  this  work  without 
requesting  additional  funds.  The  genome  was  sequenced  to  5X  coverage  and  a 
comparative  analysis  with  the  P.  falciparum  genome  was  performed.  This  work  was 
published  in  Nature  on  Oct.  3,  2002 


Sequencing  of  P.  vivax  to  5X  coverage  (Specific  Aim  5) 

P.  vivax  is  the  second  most  important  human  malaria  parasite.  In  last  year’s 
report,  we  described  the  addition  of  this  Specific  Aim  to  the  Cooperative  Agreement. 
Two  genomic  shotgun  libraries  of  P.  vivax  (Sail  strain)  were  constructed  using  DNA 
provided  by  John  Barnwell  of  the  Centers  for  Disease  Control.  One  library  contained  2-3 
kb  inserts  and  the  other  library  contained  5-6  kb  inserts.  As  of  Dec.  15,  2002,  220,7763 
sequences  with  an  average  read  length  of  650  nucleotides  have  been  generated  at  a 
success  rate  of  85.6%.  A  preliminary  assembly  has  been  performed  which  suggests 
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that  the  P.  vivax  genome  is  about  23  Mb  in  size,  about  the  same  as  that  of  P. 
falciparum.  Furthermore,  the  use  of  iibraries  with  inserts  of  5-6  kb  resulted  in  the 
generation  of  large  scaffolds  of  contigs  linked  by  fonward-reverse  read  pairs,  which  will 
faciliate  gap  closure.  Preliminary  contigs  are  available  for  searching  on  the  TIGR  web 
site  (httD://www.tiar.ora/tdb/e2k1/pva1/). 

Our  original  intention  was  to  sequence  to  5X  coverage  and  perform  a 
comparative  analysis  with  P.  falciparum.  However,  due  to  reductions  in  sequencing 
costs  and  additional  funds  obtained  from  a  no-cost  extension  to  an  NIAID-funded 
project,  we  should  be  able  to  obtain  8X  coverage  and  close  up  to  one-third  of  the  P. 
vivax  genome.  This  will  facilitate  a  more  detailed  analysis  of  the  genome  sequence  and 
assist  in  the  identification  of  novel  vaccine  and  drug  targets  for  this  relatively  little- 
studied  malaria  parasite. 

To  summarize  the  sequencing  efforts  in  Specific  Aims  1-5,  by  the  end  of  this 
cooperative  agreement,  the  complete  genome  sequence  of  P.  falciparum  will  have  been 
published,  as  will  a  comparative  analysis  of  the  P.  faiciparum  and  P.  yoe/// genomes. 
The  P.  vivax  genome  will  have  been  sequenced  to  5X  coverage.  A  3-way  comparative 
analysis  of  the  P.  falciparum,  P.  vivax,  and  P.  yoein  genomes  will  be  also  be  performed, 
but  this  is  unlikely  to  be  completed  until  after  this  cooperative  agreement  has  expired. 


Proteomics  studies 

A  major  goal  of  the  malaria  genome  project  is  to  identify  antigens  for  vaccine 
development.  Analysis  of  the  genome  sequence  data  can  be  used  to  identify  potential 
antigens  but  does  not  by  itself  provide  all  of  the  information  required  for  selection  and 
prioritization  of  vaccine  candidates.  For  example,  the  genome  sequence  itself  does  not 
specify  at  which  point  in  the  life  cycle  a  gene  is  transcribed,  or  whether  the  protein 
product  of  a  gene  is  actually  present  in  the  parasite.  To  identify  proteins  present  in 
various  stages  of  the  parasite  life  cycle,  we  have  begun  to  use  proteomics  techniques  to 
directly  identify  parasite  proteins  in  cell  lysates. 

In  the  last  two  annual  reports  we  reported  on  work  done  by  Dr.  John  Yates  and 
colleagues  at  the  Scripps  Research  Institute,  partly  funded  by  a  subcontract  from  TIGR 
under  this  cooperative  agreement.  Briefly,  proteins  in  parasite  lysates  were  digested 
with  proteases  and  the  resulting  peptides  were  separated  by  high-resolution  liquid 
chromatography.  The  peptides  were  then  injected  into  a  tandem  mass  spectrometer. 
Spectra  of  each  peptide  were  matched  against  predicted  spectra  of  the  peptides 
predicted  from  the  genome  sequence.  In  this  way  peptides  generated  from  cell  lysates 
were  used  to  identify  the  proteins  present  in  the  cell  lysate.  Our  role  has  been  mainly  to 
provide  Dr.  Yates’s  group  with  genomic  sequence  data  from  P.  falciparum  and  P.  yoelii, 
which  they  used  to  identify  peptides  derived  from  parasite  lysates.  Over  2,400  P. 
falciparum  proteins,  about  45%  of  the  total  proteins  predicted  from  the  genome 
sequence,  were  identified,  including  approx  500  proteins  from  sporozoite  stages^°.  The 
NMRC  is  using  this  data  to  select  antigens  for  vaccine  development. 
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Key  Research  Accomplishments 

1 )  The  sequences  of  chromosomes  2, 1 0, 1 1 ,  and  14  were  completed. 

2)  Chromosomes  2,  1 0,  1 1 ,  and  14  were  annotated  at  TIGR. 

3)  In  collaboration  with  the  Sanger  Institute  and  Stanford  University,  the  entire  P. 
falciparum  genome  was  annotated. 

4)  The  P.  yoe//7  genome  sequence  obtained  at  5X  coverage  was  annotated. 

5)  Sequencing  of  the  P.  vivax  genome  reached  6X  coverage.  Preliminary 
assemblies  were  performed  and  indicate  that  the  final  sequence  will  be  of  high 
quality. 


Reportable  Outcomes 

1)  Web  site.  Preliminary  contigs  and  annotation  for  the  P.  vivax  genome  at  5X 
coverage.  <'httD://www.tiar.ora/tdb/e2k1/pva1)). 


2)  Web  site.  Final  annotation  of  the  P.  falciparum  genome 
(http://www.tigr.org/tdb/e2k1/pfa1/ 

3)  Publication.  Gardner,  M.  J.,  Hall,  N.,  Fung,  E.,  White,  O.,  Berriman,  M.,  Hyman, 
R.  W.,  Carlton,  J.  M.,  Pain,  A.,  Nelson,  K.  E.,  Bowman,  S.,  Paulsen,  I.  T.,  James, 
K.,  Eisen,  J.  A.,  Rutherford,  K.,  Salzberg,  S.  L.,  Craig,  A.,  Kyes,  S.,  Chan,  M.-S., 
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J.  C.,  Carucci,  D.  J.,  Hoffman,  S.  L.  &  Fraser,  C.  M.  Sequence  of  Plasmodium 
falciparum  chromosomes  2,  10,  11,  and  14.  Nature  419,  531-534  (2002). 


5)  Publication.  Carlton,  J.  M.,  Angiuoli,  S.  V.,  Suh,  B.  B.,  Kooij,  T.,  Pertea,  M., 
Ermolaeva,  M.  D.,  Allen,  D.  R.,  Silva,  J.  C.,  Selengut,  J.  D.,  Koo,  H.  L.,  Peterson, 
J.  D.,  Pop,  M.,  Kosack,  D.  S.,  Shumway,  M.  F.,  Bidwell,  S.  L,  Shallom,  S.  J., 
Shoaibi,  A.,  Cummings,  L.  M.,  Cho,  J.  K.,  Quackenbush,  J.,  van  Aken,  S.  E., 
Riedmuller,  S.  B.,  Feldbylum,  T.  V.,  Florens,  L,  Yates  III,  J.  R.,  Raine,  D.  J., 
Sinden,  R.  E.,  Harris,  M.  A.,  Cunningham,  D.  A.,  Preiser,  P.  R.,  Bergman,  L  W., 
Vaidya,  A.  B.,  van  Lin,  L.  H.,  Janse,  C.  J.,  Waters,  A.  P.,  Smith,  H.  O.,  White,  O. 
R.,  Salzberg,  S.  L.,  Venter,  J.  C.,  Fraser,  C.  M.,  Hoffman,  S.  L.,  Gardner,  M.  J.  & 
Carucci,  D.  J.  Genome  sequence  and  comparative  analysis  of  the  model  rodent 
malaria  parasite  Plasmodium  yoelii.  Nature  419,  512-519  (2002). 

6)  M.  J.  Gardner,  Complete  genome  sequence  of  the  human  rpalaria  parasite 
Plasmodium  falciparum.  Press  conference  to  announce  publication  of  the  P. 
falciparum  genome  sequence  in  Nature.  Held  at  the  American  Association  for  the 
Advancement  of  Science,  Washington,  D.C.,  Oct.  O'"*,  2002. 

7)  M.  J.  Gardner,  Complete  genome  sequence  of  Plasmodium  falciparum.  Paper 
presented  at  the  American  Society  of  Tropical  Medicine  and  Hygiene  Annual 
Meeting,  Denver,  Nov.  2002. 

8)  M.  J.  Gardner,  Complete  genome  sequence  of  Plasmodium  falciparum.  Paper 
presented  at  the  3^^  Multilateral  Initiative  on  Malaria  Meeting,  Arusha,  Tanzania, 
Nov.  2002. 


Conclusions 

The  objectives  of  this  5-year  Cooperative  Agreement  between  TIGR  and 
the  Malaria  Program,  NMRC,  were  to:  Specific  Aim  1,  sequence  3.5  Mb  of  P. 
falciparum  genomic  DNA;  Specific  Aim  2,  annotate  the  sequence;  Specific  Aim  3, 
release  the  information  to  the  scientific  community.  Two  additional  Specific  Aims  were 
added  to  the  Cooperative  Agreement:  Specific  Aim  4,  sequencing  of  P.  yoelii  to  3X 
coverage:  Specific  Aim  5,  sequencing  of  P.  vivax  to  3X  coverage. 

By  publishing  the  complete  genome  sequence  of  P.  falciparum,  and 
chromosomes  2,  10,  11  and  14,  we  have  completed  Specific  Aims  1-3.  By 
sequencing  P.  yoelii  to  5X  coverage  and  publishing  an  analysis  of  this  genome  we  have 
completed  Specific  Aim  4.  We  received  a  12-month  no-cost  extension  to  assist  in 
completion  of  Specific  Aim  5,  sequencing  of  P.  vivax  to  5X  coverage.  This  has  been 
completed,  but  by  using  funds  from  other  sources  we  will  be  able  to  enhance  Specific 
Aim  5  by  closing  up  to  one-third  of  the  P.  v/Vax  genome  sequence,  annotating  the 


in 


genome  sequence,  and  perform  a  comparative  analysis  of  the  P.  vivax  and  P. 
falciparum  genomes. 
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articles 

Genome  sequence  of  the  human  malaria 
parasite  PlasmotHum  falcipamm 

Malcolin  J.  Ganlner^  Neil  Hall^  Eula  Fung^  Owen  Whlte\  MatHiew  BefTilnan^  Richard  W.  Hylllan^  Jane  M.  Cariton\  Amah  Pain^ 

Karen  t  llelson\  Sh«en  Bowman^  Ian  T.  Paulsen\  Keith  James^  Jonathan /L  Eisen^  Kim  RlItlled^Itl^  Stcv^ 

Sue  Kyes^  Man-Suen  Chan®,  Vishvanath  Nene\  Shamira  J.  Shallom^  Beniard  Siih\  Jeremy  Peterson^  Sam  AnghioliS  Mihaela  Pertea  , 
Jonathan  Allen\  Jeremy  Selengul*,  Daniel  Halt',  Michael  W.  Mathef^,  Akhil  B.  VaWya",  David  M.  A.  Martin^  Alan  H.  Fairtamh", 

Martin  J.  Fratmholz^,  David  S.  Boos®,  Stuart  A.  Ral|lh^  Geoffrey  I.  McFadden^  Leda  M.  Cummings',  G.  ManI  Subramanian'^  Chris  Mungall  , 
J.  Craig  Vel^te^'^  Daniel  J.  CaniccI'®,  Stephen  L  Hoffman'®*,  Chris  Newbold®,  Ronald  W.  Davis®,  Claire  M.  Fraser'  &  Bart  Barrel!" 


The  parasite  Plasmodium  fylciparum  is  responsible  for  hundreds  of  millions  of  cases  of  malaria,  and  kills  more  than  one  million 
African  children  annually.  Here  we  report  an  analysis  of  the  genome  sequence  of  P.  falciparum  clone  3D7.  The  ZS-megabase 
nuclear  genome  consists  of  14  chromosomes,  encodes  about  5,300  genes,  and  is  the  most  (A  +  T)-rich  genome  sequenced  to  date. 
Genes  involved  in  antigenic  variation  are  concentrated  in  the  subtelomeric  regions  of  the  chromosomes.  Compared  to  the  genomes 
of  free-living  eukaryotic  microbes,  the  genome  of  this  intracellular  parasite  encodes  fewer  enzymes  and  transporters,  but  a  large 
proportion  of  genes  are  devoted  to  immune  evasion  and  host-parasite  interactions.  Many  nuclear-encoded  proteins  are  targeted  to 
the  apicoplast,  an  organelle  involved  in  fatty-acid  and  isoprenoid  metabolism.  The  genome  sequence  provides  the  foundation  for 
future  studies  of  this  organism,  and  is  being  exploited  in  the  search  for  new  drugs  and  vaccines  to  fight  malaria. 


Despite  more  than  a  century  of  efforts  to  eradicate  or  control 
malaria>  the  disease  remains  a  major  and  growing  threat  to  the 
public  health  and  economic  development  of  countries  in  the 
tropical  and  subtropical  regions  of  the  world.  Approximately  40% 
of  the  world’s  population  lives  in  areas  where  malaria  is  transmitted. 
There  are  an  estimated  300-500  million  cases  and  up  to  2.7  million 
deaths  from  malaria  each  year.  The  mortality  levels  are  greatest  in 
sub-Saharan  Africa,  where  children  under  5  years  of  age  account  for 
90%  of  all  deaths  due  to  malaria*.  Human  malaria  is  caused  by 
infection  with  intracellular  parasites  of  the  genus  Plasmodium  that 
are  transmitted  by  Anopheles  mosquitoes.  Of  the  four  species  of 
Plasmodium  that  infect  humans,  Plasmodium  falciparum  is  the  most 
lethal.  Resistance  to  anti-malarial  drugs  and  insecticides,  the  decay 
of  public  health  infrastructure,  population  movements,  political 
unrest,  and  environmental  changes  are  contributing  to  the  spread  of 
malaria^.  In  countries  with  endemic  malaria,  the  annual  economic 
growth  rates  over  a  25-year  period  were  1.5%  lower  than  in  other 
countries.  This  implies  that  the  cumulative  effect  of  the  lower 
annual  economic  output  in  a  malaria-endemic  country  was  a  50% 
reduction  in  the  per  capita  GDP  compared  to  a  non-malarious 
coimtry*.  Recent  studies  suggest  that  the  number  of  malaria  cases 
may  double  in  20  years  if  new  methods  of  control  are  not  devised 
and  implemented'. 

An  international  effort*  was  launched  in  1996  to  sequence  the 
P.  falciparum  genome  with  the  expectation  that  the  genome 
sequence  would  open  new  avenues  for  research.  The  sequences  of 
two  of  the  14  chromosomes,  representing  8%  of  the  nuclear 
genome,  were  published  previously®-^  and  the  accompanying  Letters 
in  this  issue  describe  the  sequences  of  chromosomes  1,  3-9  and  13 
(ref.  7),  2,  10, 11  and  14  (ref.  8),  and  12  (ref.  9).  Here  we  report  an 
analysis  of  the  genome  sequence  of  P.  falciparum  clone  3D7, 
including  descriptions  of  chromosome  structure,  gene  content. 


*  The  Institute  for  Genomic  Research,  9712  Medical  Center  Drive,  Rockville,  Marylarul  20850,  USA;  ^  The 
Wellcome  Trust  Sanger  Institute,  The  Wellcome  Trust  Genome  Campus,  Hinxton,  Cambridge  CBIO 1 SA, 
UK;  ’Sunford  Geiwme  Technology  Center,  855  California  Avenue,  Palo  Alto,  California  94304,  USA; 

*  Uverpool  School  of  Tropical  Medicine.  Pembroke  Place,  Liverpool  U  5QA,  UK;  *  University  of  Oxford, 
Weatherall  Institute  of  Molecular  Medicine,  John  Raddiffe  HospitaL  Headington,  Oxford  0X3  9DU,  UK; 
‘Department  of  Microbiology  and  Immunology,  Drexel  University  College  of  Medicine,  2900  Queen 
Une,  Philadelphia,  Pennsylvania  19129,  USA;  ^  School  of  Ufe  Sciences,  The  Wellcome  Trust  Biocentre, 
The  University  of  Dundee,  Dundee  DDl  5EH.  UK;  *  Department  of  Biology  and  Genomics  Institute. 
University  of  Pennsylvania,  Philadelphia,  Pennsylvania  19104-6018,  USA;  ’Plant  Cell  Biology  Research 


functional  classification  of  proteins,  metabolism  and  transport, 
and  other  features  of  parasite  biology. 

Sequencing  strategy 

A  whole  chromosome  shotgun  sequencing  strategy  was  used  to 
determine  the  genome  sequence  of  P.  falciparum  done  3D7.  This 
approach  was  taken  because  a  whole  genome  shotgun  strategy  was 
not  feasible  or  cost-effective  with  the  technology  that  was  available 
at  the  beginning  of  the  project.  Also,  high-quality  large  insert 
libraries  of  (AH-T)-rich  P.  falciparum  DNA  have  never  been 
constructed  in  Escherichia  co/i,  which  ruled  out  a  clone-by-clone 
sequencing  strategy.  The  chromosomes  were  separated  on  pulsed 
field  gels,  and  chromosomal  DNA  was  extracted  and  used  to 
construct  shotgun  libraries  of  1-3-kilobase  (kb)  fragments  of 
sheared  DNA.  Eleven  of  the  fourteen  chromosomes  could  be 
resolved  on  the  gels,  but  chromosomes  6,  7  and  8  could  not  be 
resolved  and  were  sequenced  as  a  group.  The  shotgun  sequences 
were  assembled  into  contiguous  DNA  sequences  (contigs),  in  some 
cases  with  low  coverage  shotgun  sequences  of  yeast  artificial 
chromosome  (YAC)  clones  to  assist  in  the  ordering  of  contigs  for 
closure.  Sequence  t^ged  sites  (STSs)'°,  microsatellite  markers"’*^ 
and  HAPPY  mapping^  were  also  used  to  place  and  orient  contigs 
during  the  gap  closure  process.  The  high  (A  +  T)  content  of  the 
genome  made  gap  closure  extremely  difficult^”’.  The  predicted 
restriction  enzyme  maps  of  the  chromosome  sequences  were 
compared  to  optical  restriction  maps  to  verify  that  the  chromo¬ 
somes  had  been  assembled  correctly’®.  Chromosomes  1-5, 9  and  12 
were  closed,  whereas  chromosomes  6-8, 10, 11, 13  and  14  contained 
3-37  gaps  (most  <2.5  kb)  per  chromosome  at  the  beginning  of 
genome  annotation.  Efforts  to  close  the  remaining  gaps  are 
continuing. 
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Genome  structure  and  content 

The  P.  falciparum  3D7  nuclear  genome  is  composed  of  22.8  mega¬ 
bases  (Mb)  distributed  among  14  chromosomes  ranging  in  size 
from  approximately  0.643  to  3.29  Mb  (Fig.  1,  and  Supplementary 
Figs  A-N).  Thus  the  R  falciparum  genome  is  almost  twice  the  size  of 
the  genome  of  the  fission  yeast  Schizosaccharomyces  pomhe.  The 
overall  (A  +  T)  composition  is  80.6%,  and  rises  to  ^^90%  in  introns 
and  intergenic  regions.  The  structures  of  protein- encoding  genes 
were  predicted  using  several  gene-finding  programs  and  manually 
curated.  Approximately  5,300  protein-encoding  genes  were  ident¬ 
ified,  about  the  same  as  in  S.  pombe  (Table  1,  and  Supplementary 
Table  A).  This  suggests  an  average  gene  density  in  P.  falciparum  of  1 
gene  per  4,338  base  pairs  (bp),  slightly  higher  than  was  found 
previously  with  chromosomes  2  and  3  (1  per  4,500  bp  and  1  per 
4,800  bp,  respectively).  The  higher  gene  density  reported  here  is 
probably  the  result  of  improved  gene-finding  software  and  larger 
training  sets  that  enabled  the  detection  of  genes  overlooked  pre¬ 
viously®.  Introns  were  predicted  in  54%  of  P.  falciparum  genes,  a 
proportion  roughly  similar  to  that  in  S.  pombe  and  Dictyostelium 
discoideumy  but  much  higher  than  observed  in  Saccharomyces 
cerevisiae  where  only  5%  of  genes  contain  introns.  Excluding 
introns,  the  mean  length  of  P.  falciparum  genes  was  2.3  kb,  sub¬ 
stantially  larger  than  in  the  other  organisms  in  which  the  average 
gene  lengths  range  from  1.3  to  1.6  kb.  Plasmodium  falciparum  genes 
showed  a  markedly  greater  proportion  of  genes  (15.5%)  longer  than 
4  kb  compared  to  S.  pombe  and  S.  cerevisiae  (3.0%  and  3.6%, 
respectively).  The  explanation  for  the  increased  gene  length  in 
P.  falciparum  is  not  clear.  Many  of  these  large  genes  encode 
uncharacterized  proteins  that  may  be  cytosolic  proteins,  as  they 
do  not  possess  recognizable  signal  peptides.  No  transposable 
elements  or  retrotransposons  were  identified. 

Fifty-two  per  cent  of  the  predicted  gene  products  (2,731)  were 
detected  in  cell  lysates  prepared  from  several  stages  of  the  parasite 
life  cycle  by  high-resolution  liquid  chromatography  and  tandem 
mass  spectrometry^^*^^,  including  many  predicted  proteins  with  no 
similarity  to  proteins  in  other  organisms.  In  addition,  49%  of  the 
genes  overlapped  (97%  identity  over  at  least  100  nucleotides)  with 
expressed  sequence  tags  (ESTs)  derived  from  several  life-cycle 
stages.  As  the  proteomics  and  EST  studies  performed  to  date  may 


not  represent  a  complete  sampling  of  all  genes  expressed  during  the 
complex  life  cycle  of  the  parasite,  this  suggests  that  the  annotation 
process  identified  substantial  portions  of  most  genes.  However,  in 
the  absence  of  supporting  EST  or  protein  evidence,  correct  predic¬ 
tion  of  the  5'  ends  of  genes  and  genes  with  multiple  small  exons  is 
challenging,  and  the  gene  models  should  be  regarded  as  preliminary. 
Additional  ESTs  and  full-length  complementary  DNA  sequences^^ 
are  required  for  the  development  of  better  training  sets  for  gene¬ 
finding  programs  and  the  verification  of  the  predicted  genes. 

The  nuclear  genome  contains  a  full  set  of  transfer  RNA  (tRNA) 
ligase  genes,  and  43  tRNAs  were  identified  to  bind  all  codons  except 
TGT  and  TGC,  coding  for  Cys;  it  is  possible  that  these  tRNAs  are 
located  within  the  currently  unsequenced  regions.  All  codons  end¬ 
ing  in  C  and  T  appear  to  be  read  by  single  tRNAs  with  a  G  in  the  first 
position,  which  is  likely  to  read  both  codons  via  G:U  wobble.  Each 
anticodon  occurs  only  once  except  for  methionine  (CAT),  for  which 
there  are  two  copies,  one  for  translation  initiation  and  one  for 
internal  methionines,  and  the  glycine  (CCT)  anticodon,  which 
occurs  twice.  An  unusual  tRNA  resembling  a  selenocysteinyl- 
tRNA  was  also  found.  A  putative  selenocysteine  lyase  was  identified, 
which  may  provide  selenium  for  synthesis  of  selenoproteins. 
Increased  growth  has  been  observed  in  selenium-supplemented 
Plasmodium  culture^’'. 

In  almost  ail  other  eukaryotic  organisms  sequenced  to  date,  the 
tRNA  genes  exhibit  extensive  redundancy,  the  only  exception  being 
the  intracellular  parasite  Encephalitozoon  cuniculi  which  contains 
44  tRNAs^®.  Often,  the  abundance  of  specific  anticodons  is  corre¬ 
lated  with  the  codon  usage  of  the  organism^^’^*^.  This  is  not  the  case 
in  R  falciparumy  which  exhibits  minimal  redundancy  of  tRNAs.  The 
mitochondrial  genome  of  Plasmodium  is  small  (about  6  kb)  and 
encodes  no  tRNAs,  so  the  mitochondrion  must  import  tRNAs^^’^^. 
Through  their  import,  cytoplasmic  tRNAs  may  serve  mitochondrial 
protein  synthesis  in  a  manner  seen  with  other  organisms^^’^'^.  The 
apicoplast  genome  appears  to  encode  sufficient  tRNAs  for  protein 
synthesis  within  the  organelle^^. 

Unlike  many  other  eukaryotes,  the  malaria  parasite  genome  does 
not  contain  long  tandemly  repeated  arrays  of  ribosomal  RNA 
(rRNA)  genes.  Instead,  Plasmodium  parasites  contain  several  single 
18S-5.8S-28S  rRNA  units  distributed  on  different  chromosomes. 


Table  1  Plasmodium  falciparum  nuclear  genome  summary  and  comparison  to  other  organisms 

Feature  - - ^ - 


P.  falciparum  S.  pombe 


Value 


S.  cerevisiae 


D.  discoideum  A.  thaliana 


Size  (bp) 

22,853,764 

12,462,637 

12,495,682 

(G  +  C)  content  (%) 

19.4 

36.0 

38.3 

No.  of  genes 

5,268* 

4,929 

5,770 

Mean  gene  lengthf  (bp) 

2,283 

1,426 

1,424 

Gene  density  (bp  per  gene) 

4,338 

2,528 

2,088 

Per  cent  coding 

52.6 

57.5 

70.5 

Genes  with  introns  (%) 

53.9 

43 

5.0 

Exons 

ND 

Number 

12,674 

ND 

No.  per  gene 

2.39 

ND 

NA 

(G  +  C)  content  (%) 

23.7 

39.6 

28.0 

Mean  length  (bp) 

949 

ND 

ND 

Total  length  (bp) 

12,028,350 

ND 

ND 

8,100,000 

22.2 

2,799 

1,626 

2,600 

56.3 

68 

6,398 

2.29 

28.0 

711 

4,548,978 


115,409,949 

34.9 

25,498 

1,310 

4,526 

28.8 

79 

132,982 

5.18 

ND 

170 

33,249,250 


Introns 


Number 

7,406 

4,730 

272 

(G  +  C)  content  (%) 

13.5 

ND 

NA 

Mean  length  (bp) 

178.7 

81 

NA 

Total  length  (bp) 

1,323,509 

383,130 

ND 

Intergenic  regions 

(G  +  C)  content  (%) 

13.6 

ND 

NU 

Mean  length  (bp) 

1,694 

952 

515 

RNAs 

No.  of  tRNA  genes 

43 

174 

ND 

No.  of  5S  rRNA  genes 

3 

30 

ND 

No.  of  5.8S.  18S  and  28S  rRNA  units 

7 

200-400 

ND 

3,587 

107,784 

13.0 

ND 

177 

170 

643,899 

18,055,421 

14.0 

ND 

786 

ND 

73 

ND 

NA 

ND 

NA 

700-800 

ND,  not  determined:  NA,  not  applicable.  ‘No.  of  genes’  for  D.  discoideum  are  for  chromosome  2  (ref.  1 55)  and  in  some  cases  represent  extrapolations  to  the  entire  genome.  Sources  of  data  for  the  other 
organisms:  S.  pombe®®,  S.  cerevisiae'^,  D.  discoideum'^^  and  A.  thaliana'^^. 

*70%  of  these  genes  matched  expressed  sequence  tags  or  encoded  proteins  detected  by  proteomics  analyses^^'’®. 
t  Excluding  introns. 
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The  sequence  encoded  by  a  rRNA  gene  in  one  unit  differs  from  the 
sequence  of  the  corresponding  rRNA  in  the  other  units.  Further¬ 
more,  the  expression  of  each  rRNA  unit  is  developmentally  regu¬ 
lated,  resulting  in  the  expression  of  a  different  set  of  rRNAs  at 
different  stages  of  the  parasite  life  cycle^®’^^.  It  is  likely  that  by 
changing  the  properties  of  its  ribosomes  the  parasite  is  able  to  alter 
the  rate  of  translation,  either  globally  or  of  specific  messenger  RNAs 
(mRNAs),  thereby  changing  the  rate  of  cell  growth  or  altering 
patterns  of  cell  development.  The  two  types  of  rRNA  genes 
previously  described  in  P.  falciparum  are  the  S-type,  expressed 
primarily  in  the  mosquito  vector,  and  the  A-type,  expressed  pri¬ 
marily  in  the  human  host.  Seven  loci  encoding  rRNAs  were 
identified  in  the  genome  sequence  (Fig.  1).  Two  copies  of  the 
S-type  rRNA  genes  are  located  on  chromosomes  11  and  13,  and 
two  copies  of  the  A-type  genes  are  located  on  chromosomes  5  and  7. 
In  addition,  chromosome  1  contains  a  third,  previously  unchar¬ 
acterized,  rRNA  unit  that  encodes  18S  and  5.8S  rRNAs  that  are 
almost  identical  to  the  S-type  genes  on  chromosomes  11  and  13, 
but  has  a  significantly  divergent  28S  rRNA  gene  (65%  identity  to  the 
A-type  and  75%  identity  to  the  S-type).  The  expression  profiles  of 
these  genes  are  unknown.  Chromosome  8  also  contains  two  unusual 
rRNA  gene  units  that  contain  5.8S  and  28S  rRNA  genes  but  do  not 
encode  18S  rRNAs;  it  is  not  known  whether  these  genes  are 
functional.  The  sequences  of  the  18S  and  28S  rRNA  genes  on 
chromosome  7  and  the  28S  rRNA  gene  on  chromosome  8  are 
incomplete  as  they  reside  at  contig  ends.  The  5S  rRNA  is  encoded  by 
three  identical  tandemly  arrayed  genes  on  chromosome  14. 

Chromosome  structure 

Plasmodium  falciparum  chromosomes  vary  considerably  in  length, 
with  most  of  the  variation  occurring  in  the  subtelomeric  regions. 
Field  isolates,  even  those  from  individuals  residing  in  a  single 
village^®,  exhibit  extensive  size  polymorphism  that  is  thought  to 
be  due  to  recombination  events  between  different  parasite  clones 
during  meiosis  in  the  mosquito^^.  Chromosome  size  variation  is 
also  observed  in  cultures  of  erythrocytic  parasites,  but  is  due  to 
chromosome  breakage  and  healing  events  and  not  to  meiotic 
recombination^^’^\  Subtelomeric  deletions  often  extend  well  into 
the  chromosome,  and  in  some  cases  alter  the  ceU  adhesion  proper¬ 
ties  of  the  parasite  owing  to  the  loss  of  the  gene(s)  encoding 
adhesion  molecules^^’^^  Because  many  genes  involved  in  antigenic 
variation  are  located  in  the  subtelomeric  regions,  an  understanding 
of  subtelomere  structure  and  functional  properties  is  essential  for 
the  elucidation  of  the  mechanisms  underlying  the  generation  of 
antigenic  diversity. 

The  subtelomeric  regions  of  the  chromosomes  display  a  striking 
degree  of  conservation  within  the  genome  that  is  probably  due  to 
promiscuous  inter-chromosomal  exchange  of  subtelomeric  regions. 
Subtelomeric  exchanges  occur  in  other  eukaryotes^^^^,  but  the 
regions  involved  are  much  smaller  (2. 5-3.0  kb)  in  S.  cerevisiae 
(data  not  shown).  Previous  studies  of  P.  falciparum  telomeres^^’^® 
suggested  that  they  contained  six  blocks  of  repetitive  sequences 
that  were  designated  telomere -associated  repetitive  elements 
(TAREs  1-6). 

Whole  genome  analysis  reveals  a  larger  (up  to  120  kb),  more 
complex,  subtelomeric  repeat  structure  than  was  observed  pre¬ 
viously.  The  conserved  regions  fall  into  five  large  subtelomeric 
blocks  (SBs;  Fig.  2).  The  sequences  within  blocks  2, 4  and  5  include 
many  tandem  repeats  in  addition  to  those  described  previously,  as 
well  as  non-repetitive  regions.  Subtelomeric  block  1  (SB-1,  equiva¬ 
lent  to  TARE-1),  contains  the  7-bp  telomeric  repeat  in  a  variable 
number  of  near-exact  copies^^.  SB-2  contains  several  sub-blocks  of 
repeats  of  different  sizes,  including  TAREs  2-5  and  other  sequences. 
The  beginning  of  SB-2  consists  of  about  1,000-1,300  bp  of  non- 
repetitive  sequence,  followed  on  some  chromosomes  by  2.5  copies 
of  a  164-bp  repeat.  This  is  followed  by  another  300  bp  of  non- 
repetitive  sequence,  and  then  10  copies  of  a  135 -bp  repeat,  the  main 


element  of  TARE-2.  TARE-2  is  followed  by  200  bp  of  non-repetitive 
sequence,  and  then  two  copies  of  a  highly  conserved  63-bp  repeat. 
SB-2  extends  for  another  6  kb  that  contains  non-repetitive  sequence 
as  well  as  other  tandem  repeats.  Only  four  of  the  28  telomeres  are 
missing  SB-2,  which  always  occurs  immediately  adjacent  to  SB-1.  A 
notable  feature  of  SB-2  is  the  conserved  order  and  orientation  of 
each  repeat  variant  as  well  as  the  sequence  homology  extending 
throughout  the  block.  For  almost  any  two  chromosomes  that  were 
examined,  a  consistently  ordered  series  of  unique,  identical 
sequences  of  >30  bp  that  are  distributed  across  SB-2  were  identi¬ 
fied,  suggesting  that  SB-2  is  a  repeat  with  a  complex  internal 
structure  occurring  once  per  telomere. 

SB-3  consists  of  the  Rep20  element^®,  a  large  block  of  highly 
variable  copies  of  a  21 -bp  repeat.  The  tandem  repeats  in  SB-3  occur 
in  a  random  order  (Fig.  2).  SB-4  has  not  been  described  previously, 
although  it  does  contain  the  previously  described  R-FA3  sequence^^ 
SB-4  also  includes  a  complex  mix  of  short  (<28-bp)  tandem 
repeats,  and  a  105 -bp  repeat  that  occurs  once  in  each  subtelomere. 
Many  telomeres  contain  one  or  more  var  (variant  antigen)  gene 
exons  within  this  block,  which  appear  as  gaps  in  the  alignment.  In 
five  subtelomeres,  fi:agments  of  2-4  kb  from  SB-4  are  duplicated  and 
inverted.  SB- 5  is  found  in  half  of  the  subtelomeres,  does  not  contain 
tandem  repeats,  and  extends  up  to  120  kb  into  some  chromosomes. 
The  arrangement  and  composition  of  the  subtelomeric  blocks 
suggests  fi:equent  recombination  between  the  telomeres. 

Centromeres  have  not  been  identified  experimentally  in  malaria 
parasites.  However,  putative  centromeres  were  identified  by  com¬ 
parison  of  the  sequences  of  chromosomes  2  and  3  (ref.  6).  Eleven  of 
the  14  chromosomes  contained  a  single  region  of  2-3  kb  with 
extremely  high  (A  +  T)  content  (>97%)  and  imperfect  short 
tandem  repeats,  features  resembling  the  regional  S.  pombe  centro¬ 
meres;  the  3  chromosomes  lacking  such  regions  were  incomplete. 

The  proteome 

Of  the  5,268  predicted  proteins,  about  60%  (3,208  hypothetical 
proteins)  did  not  have  sufficient  similarity  to  proteins  in  other 
organisms  to  justify  provision  of  functional  assignments  (Table  2). 
This  is  similar  to  what  was  found  previously  with  chromosomes  2 
and  3  (refs  5,  6).  Thus,  almost  two-thirds  of  the  proteins  appear  to 
be  unique  to  this  organism,  a  proportion  much  higher  than 
observed  in  other  eukaryotes.  This  maybe  a  reflection  of  the  greater 
evolutionary  distance  between  Plasmodium  and  other  eukaryotes 
that  have  been  sequenced,  exacerbated  by  the  reduction  of  sequence 
similarity  due  to  the  (A  -h  T)  richness  of  the  genome.  Another  257 
proteins  (5%)  had  significant  similarity  to  hypothetical  proteins  in 
other  organisms.  Thirty-onei  per  cent  (1,631)  of  the  predicted 
proteins  had  one  or  more  transmembrane  domains,  and  17.3% 
(911)  of  the  proteins  possessed  putative  signal  peptides  or  signal 
anchors. 

The  Gene  Ontology  (GO)^^  database  is  a  controlled  vocabulary 
that  describes  the  roles  of  genes  and  gene  products  in  organisms. 
GO  terms  were  assigned  manually  to  2,134  gene  products  (40%) 


Figure  1  Schematic  representation  of  the  P.  falciparum  3D7  genome.  ► 

Protein-encoding  genes  are  indicated  by  open  diamonds.  Aii  genes  are  depicted  at 
the  same  scale  regardiess  of  their  size  or  structure.  The  labels  indicate  the  name  for  each 
gene.  The  rows  of  coloured  rectangles  represent,  from  top  to  bottom  for  each 
chromosome,  the  high-level  Gene  Ontology  assignment  for  each  gene  in  the  ‘biological 
process’,  ‘molecular  function’,  and  ‘cellular  component’  ontologies'^^:  the  life-cycle 
stage(s)  at  which  each  predicted  gene  product  has  been  detected  by  proteomics 
techniques'"’^®;  and  Plasmodium  yoelii yoelii Qenes  that  exhibit  conserved  sequence  and 
organization  with  genes  in  P.  falciparum,  as  shown  by  a  position  effect  analysis. 
Rectangles  surrounding  clusters  of  P.  yoelli genes  indicate  genes  shown  to  be  linked  in  the 
P.  y.  yoe/// genome^®®.  Boxes  containing  coloured  arrowheads  at  the  ends  of  each 
chromosome  indicate  subtelomeric  blocks  (SBs;  see  text  and  Fig.  2). 
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Figure  2  Alignment  of  subtelomeric  regions  of  chromosomes  1 , 3,  6  and  11 . 
MUMmer2^“  alignments  showing  exact  matches  between  the  left  subtelomeric  regions  of 
chromosome  6  (horizontal  axis)  and  chromosomes  1 1  (red),  1  (blue)  and  3  (green), 
illustrating  the  conserved  synteny  between  all  telomeres.  Each  point  represents  an  exact 


match  of  40  bp  or  longer  that  is  shared  by  two  chromosomes  and  is  not  found  anywhere 
else  on  either  chromosome.  Each  coil  inear  series  of  points  along  a  diagonal  represents  an 
aligned  region.  SB,  subtelomeric  block;  TARE,  telomere-associated  repetitive  element. 


and  a  comparison  of  annotation  with  high-level  GO  terms  for  both 
S.  cerevisiae  and  R  falciparum  is  shown  in  Fig.  3.  In  almost  all 
categories,  higher  values  can  be  seen  for  S.  cerevisiacy  reflecting  the 
greater  proportion  of  the  genome  that  has  been  characterized 
compared  to  P.  falciparum.  There  are  two  exceptions  to  this  pattern 
that  reflect  processes  specifically  connected  with  the  parasite  fife 
cycle.  At  least  1.3%  of  P.  falciparum  genes  are  involved  in  cell-to-cell 
adhesion  or  the  invasion  of  host  cells.  As  discussed  below  (see 
‘Immune  evasion’),  P.  falciparum  has  208  genes  (3.9%)  knovm  to  be 
involved  in  the  evasion  of  the  host  immune  system.  This  is  reflected 
in  the  assignment  of  many  more  gene  products  to  the  GO  term 
‘physiological  processes’  in  R  falciparum  than  in  S.  cerevisiae  (Fig.  3). 
The  comparison  with  S.  cerevisiae  also  reveals  that  particular 


Table  2  The  P.  falciparum  proteome 


Feature 

Number 

Per  cent 

Total  predicted  proteins 

5,268 

Hypothetical  proteins 

3,208 

60.9 

InterPro  matches 

2,650 

52.8 

Pfam  matches 

1,746 

33.1 

Gene  Ontology 

Process 

1,301 

24.7 

Function 

1,244 

-  23.6 

Component 

2,412 

45.8 

Targeted  to  apicoplast 

551 

10.4 

Targeted  to  mitochondrion 

246 

4.7 

Structural  features 

Transmembrane  domain(s) 

1,631 

31 .0 

Signal  peptide 

544 

10.3 

Signal  anchor 

367 

7.0 

Non-secretory  protein 

4,357 

82.7 

Of  the  apicoplast-targeted  proteins,  1 26  were  judged  on  the  basis  of  experinnental  evidence  or  the 
predictions  of  multiple  programs®^  '®®  to  be  localized  to  the  apicoplast  with  high  confidence. 
Predicted  apicoplast  localization  for  425  other  proteins  is  based  on  an  analysis  using  only  one 
method  and  is  of  lower  confidence.  Predicted  mitochondrial  localization  was  based  upon  BLASTP 
searches  of  S.  cerevisiae  mitochondrion-targeted  proteins^®®  and  TargetP^®®  and  MitoProtir®® 
predictions:  148  genes  were  judged  to  be  targeted  to  the  mitochondrion  with  a  high  or  medium 
confidence  level,  and  an  additional  98  genes  with  a  lower  confidence  of  mitochondrial  targeting. 
Other  specialized  searches  used  the  following  programs  and  databases:  InterPro^®’;  Pfam^®^; 
Gene  Ontology^®;  transmembrane  domains,  TMHMM’®®;  signal  peptides  and  signal  anchors, 
SignalP-2.0’®^. 


categories  in  P.  falciparum  appear  to  be  under-represented.  Spor- 
ulation  and  cell  budding  are  obvious  examples  (they  are  included  in 
the  category  ‘other  cell  growth  and/or  maintenance’),  but  very  few 
genes  in  R  falciparum  were  associated  with  the  ‘cell  organization  and 
biogenesis’,  the  ‘cell  cycle’,  or  ‘transcription  factor’  categories  com¬ 
pared  to  S.  cerevisiae  (Fig.  3).  These  differences  do  not  necessarily 
imply  that  fewer  malaria  genes  are  involved  in  these  processes,  but 
highlight  areas  of  malaria  biology  where  knowledge  is  limited. 

The  apicoplast 

Malaria  parasites  and  other  members  of  the  phylum  apicomplexa 
harbour  a  relict  plastid,  homologous  to  the  chloroplasts  of  plants 
and  algae^^’^^*^^.  The  ‘apicoplast’  is  essential  for  parasite  survivaP^’^, 
but  its  exact  role  is  unclear.  The  apicoplast  is  known  to  function  in 
the  anabolic  synthesis  of  fatty  acids^’^^*^®,  isoprenoids^^  and 
haeme^^’^h  suggesting  that  one  or  more  of  these  compounds 
could  be  exported  from  the  apicoplast,  as  is  known  to  occur  in 
plant  plastids.  The  apicoplast  arose  through  a  process  of  secondary 
endosymbiosis®^"^^,  in  which  the  ancestor  of  all  apicomplexan 
parasites  engulfed  a  eukaryotic  alga,  and  retained  the  algal  plastid, 
itself  the  product  of  a  prior  endosymbiotic  event^^.  The  35-kb 
apicoplast  genome  encodes  only  30  proteins^^,  but  as  in  mitochon¬ 
dria  and  chloroplasts,  the  apicoplast  proteome  is  supplemented  by 
proteins  encoded  in  the  nuclear  genome  and  post-translationally 
targeted  into  the  organelle  by  the  use  of  a  bipartite  targeting  signal, 
consisting  of  an  amino-terminal  secretory  signal  sequence,  followed 
by  a  plastid  transit  peptide®^*^^"^^. 

In  total,  551  nuclear-encoded  proteins  (~10%  of  the  predicted 
nuclear  encoded  proteins)  that  may  be  targeted  to  the  apicoplast 
were  identified  using  bioinformatic^^  and  laboratory-based 
methods.  Apicoplast  targeting  of  a  few  proteins  has  been  verified 
by  antibody  localization  and  by  the  targeting  of  fluorescent  fusion 
proteins  to  the  apicoplast  in  transgenic  R  falciparum  or  Toxoplasma 
gondiP^  parasites.  Some  proteins  may  be  targeted  to  both  the 
apicoplast  and  mitochondrion,  as  suggested  by  the  observation 
that  the  total  number  of  tRNA  ligases  is  inadequate  for  independent 
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figure  3  Gene  Ontology  classifications.  Classification  of  P.  falciparum  proteins  according 
to  the  ‘biological  process'  (a)  and  'molecular  function’  (b)  ontologies  of  the  Gene  Ontology 
system^l 


protein  synthesis  in  the  qrtoplasm,  mitochondrion  and  apicoplast. 
In  plants,  some  proteins  lack  a  transit  peptide  but  are  targeted  to 
plastids  via  an  unknown  process.  Proteins  that  use  an  alternative 
targeting  pathway  in  P.  falciparum  would  have  escaped  detection 
with  the  methods  used. 

Nuclear-encoded  apicoplast  proteins  include  housekeeping 
enzymes  involved  in  DNA  replication  and  repair,  transcription, 
translation  and  post-translational  modifications,  cofactor  synthesis, 
protein  import,  protein  turnover,  and  specific  metabolic  and 
transport  activities.  No  genes  for  photosynthesis  or  light  perception 
are  apparent,  although  ferredoxin  and  ferredoxin-NADP  reductase 
are  present  as  vestiges  of  photosystem  I,  and  probably  serve  to 
recycle  reducing  equivalents^^  About  60%  of  the  putative  apico- 
plast-targeted  proteins  are  of  unknown  function.  Several  metabolic 
pathways  in  the  organelle  are  distinct  from  host  pathways  and 
offer  potential  parasite-specific  targets  for  drug  therapy*^  (see 
‘Metabolism’  and  ‘Transport’  sections). 

Evolution 

Comparative  genome  analysis  with  other  eukaryotes  for  which  the 
complete  genome  is  available  (excluding  the  parasite  £  cuniculi) 
revealed  that,  in  terms  of  overall  genome  content,  P,  falciparum  is 
slightly  more  similar  to  Arabidopsis  thaliana  than  to  other  taxa. 
Although  this  is  consistent  with  phylogenetic  studies*'*,  it  could  also 
be  due  to  the  presence  in  the  P.  falciparum  nuclear  genome  of  genes 
derived  from  plastids  or  from  the  nuclear  genome  of  the  secondary 
endosymbiont.  Thus  the  apparent  affinity  of  Plasmodium  and 


Arabidopsis  might  not  reflect  the  true  phylogenetic  history  of  the 
P.  falciparum  lineage.  Comparative  genomic  analysis  was  also  used 
to  identify  genes  apparently  duplicated  in  the  P.  falciparum  lineage 
since  it  split  from  the  lineages  represented  by  the  other  completed 
genomes  (Supplementary  Table  B). 

There  are  237  P.  falciparum  proteins  with  strong  matches  to 
proteins  in  all  completed  eukaryotic  genomes  but  no  matches  to 
proteins,  even  at  low  stringency,  in  any  complete  prokaryotic 
proteome  (Supplementary  Table  C).  These  proteins  help  to  define 
the  differences  between  eukaryotes  and  prokaryotes.  Proteins  in  this 
list  include  those  with  roles  in  cytoskeleton  construction  and 
maintenance,  chromatin  packaging  and  modification,  cell  cycle 
regulation,  intracellular  signalling,  transcription,  translation,  repli¬ 
cation,  and  many  proteins  of  unknown  function.  This  list  overlaps 
with,  but  is  somewhat  larger  than,  the  list  generated  by  an  analysis  of 
the  S.  pombe  genome*^  The  differences  are  probably  due  in  part  to 
the  different  stringencies  used  to  identify  the  presence  or  absence  of 
homologues  in  the  two  studies. 

A  large  number  of  nuclear-encoded  genes  in  most  eukaryotic 
species  trace  their  evolutionary  origins  to  genes  from  organelles  that 
have  been  transferred  to  the  nucleus  during  the  course  of  eukaryotic 
evolution.  Similarity  searches  against  other  complete  genomes  were 
used  to  identify  P.  falciparum  nuclear-encoded  genes  that  may  be 
derived  from  organellar  genomes.  Because  similarity  searches  are 
not  an  ideal  method  for  inferring  evolutionary  relatedness**,  phylo¬ 
genetic  analysis  was  used  to  gain  a  more  accurate  picture  of  the 
evolutionary  history  of  these  genes.  Out  of  200  candidates  exam¬ 
ined,  60  genes  were  identified  as  being  of  probable  mitochondrial 
origin.  The  proteins  encoded  by  these  genes  include  many  with 
known  or  expected  mitochondrial  functions  (for  example,  the 
tricarboxylic  acid  (TCA)  cycle,  protein  translation,  oxidative 
damage  protection,  the  synthesis  of  haem,  ubiquinone  and  pyri¬ 
midines),  as  well  as  proteins  of  unknown  function.  Out  of  300 
candidates  examined,  30  were  identified  as  being  of  probable  plastid 
origin,  including  genes  with  predicted  roles  in  transcription  and 
translation,  protein  cleavage  and  degradation,  the  synthesis  of 
isoprenoids  and  fatty  acids,  and  those  encoding  four  subunits  of 
the  pyruvate  dehydrogenase  complex.  The  origin  of  many  candidate 
organelle-derived  genes  could  not  be  conclusively  determined,  in 
part  due  to  the  problems  inherent  in  analysing  genes  of  very  high 
(A  +  T)  content.  Nevertheless,  it  appears  likely  that  the  total 
number  of  plastid-derived  genes  in  P,  falciparum  will  be  significantly 
lower  than  that  in  the  plant  A.  thaliana  (estimated  to  be  over  1,000). 
Phylogenetic  analysis  reveals  that,  as  with  the  A.  thaliana  plastid, 
many  of  the  genes  predicted  to  be  targeted  to  the  apicoplast  are 
apparently  not  of  plastid  origin.  Of 333  putative  apicoplast-targeted 
genes  for  which  trees  were  constructed,  only  26  could  be  assigned  a 
probable  plastid  origin.  In  contrast,  35  were  assigned  a  probable 
mitochondrial  origin  and  another  85  might  be  of  mitochondrial 
origin  but  are  probably  not  of  plastid  origin  (they  group  with 
eukaryotes  that  have  not  had  plastids  in  their  history,  such  as 
humans  and  fungi,  but  the  relationship  to  mitochondrial  ancestors 
is  not  clear).  The  apparent  non-plastid  origin  of  these  genes  could 
either  be  due  to  inaccuracies  in  the  targeting  predictions  or  to  the 
co-option  of  genes  derived  from  the  mitochondria  or  the  nucleus  to 
function  in  the  plastid,  as  has  been  shown  to  occur  in  some  plant 
species*^. 

Metabolism 

Biochemical  studies  of  the  malaria  parasite  have  been  restricted 
primarily  to  the  intra-erythrocytic  stage  of  the  life  cycle,  owing  to 
the  difficulty  of  obtaining  suitable  quantities  of  material  from  the 
other  life-cycle  stages.  Analysis  of  the  genome  sequence  provides  a 
global  view  of  the  metabolic  potential  of  P.  falciparum  irrespective  of 
the  life-cycle  stage  (Fig.  4).  Of  the  5,268  predicted  proteins,  733 
(~14%)  were  identified  as  enzymes,  of  which  435  (—8%)  were 
assigned  Enzyme  Commission  (EC)  numbers.  This  is  considerably 
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fewer  than  the  roughly  one-quarter  to  one-third  of  the  genes  in 
bacterial  and  archaeal  genomes  that  can  be  mapped  to  Kyoto 
Encyclopedia  of  Genes  and  Genomes  (KEGG)  pathway  diagrams^®, 
or  the  17%  of  S.  cerevisiae  open  reading  frames  that  can  be  assigned 
EC  numbers.  This  suggests  either  that  P.  falciparum  has  a  smaller 
proportion  of  its  genome  devoted  to  enzymes,  or  that  enzymes  are 
more  difficult  to  identify  in  P.  falciparum  by  sequence  similarity 
methods.  (This  difficulty  can  be  attributed  either  to  the  great 
evolutionary  distance  between  P  falciparum  and  other  well-studied 
organisms,  or  to  the  high  (A  +  T)  content  of  the  genome.)  A  few 
genes  might  have  escaped  detection  because  they  were  located  in  the 
small  regions  of  the  genome  that  remain  to  be  sequenced  (Table  1). 
However,  many  biochemical  pathways  could  be  reconstructed  in 
their  entirety,  suggesting  that  the  similarity-searching  approach  was 
for  the  most  part  successftil,  and  that  the  relative  paucity  of  enzymes 
in  P.  falciparum  may  be  related  to  its  parasitic  life-style.  A  similar 


picture  has  emerged  in  the  analysis  of  transporters  (see  ‘Transport'). 

In  erythrocytic  stages,  P.  falciparum  relies  principally  on  anaero¬ 
bic  glycolysis  for  energy  production,  with  regeneration  of  NAD by 
conversion  of  pyruvate  to  lactate^^.  Genes  encoding  all  of  the 
enzymes  necessary  for  a  functional  glycolytic  pathway  were  ident¬ 
ified  including  a  phosphofructokinase  (PFK)  that  has  sequence 
similarity  to  the  pyrophosphate-dependent  class  of  enzymes  but 
which  is  probably  ATP-dependent  on  the  basis  of  the  characteriz¬ 
ation  of  the  homologous  enzyme  in  Plasmodium  berghef^'^^.  A 
second  putative  pyrophosphate-dependent  PFK  was  also  identified 
which  possessed  N-  and  carboxy-terminal  extensions  that  could 
represent  targeting  sequences. 

A  gene  encoding  fructose  bisphosphatase  could  not  be  detected, 
suggesting  that  gluconeogenesis  is  absent,  as  are  enzymes  for 
synthesis  of  trehalose,  glycogen  or  other  carbohydrate  stores. 
Candidate  genes  for  all  but  one  enzyme  of  the  conventional  pentose 
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Figure  4  Overview  of  metabolism  and  transport  in  P.  falciparum.  Glucose  and  glycerol 
provide  the  major  carbon  sources  for  malaria  parasites.  Metabolic  steps  are  indicated  by 
arrows,  with  broken  lines  indicating  multiple  intervening  steps  not  shown;  dotted  arrows 
indicate  incomplete,  unknown  or  questionable  pathways.  Known  or  potential  organellar 
localization  is  shown  for  pathways  associated  with  the  food  vacuole,  mitochondrion  and 
apicoplast.  Small  white  squares  indicate  TCA  (tricarbocylic  acid)  cycle  metabolites  that 
may  be  derived  from  outside  the  mitochondrion.  Fuschia  block  arrows  indicate  the  steps 
inhibited  by  antimalarials;  grey  block  arrows  highlight  potential  drug  targets.  Transporters 
are  grouped  by  substrate  specificity:  inorganic  cations  (green),  inorganic  anions 


(magenta),  organic  nutrients  (yellow),  drug  efflux  and  other  (black).  Arrows  indicate 
direction  of  transport  for  substrates  (and  coupling  ions,  where  appropriate).  Numbers  in 
parentheses  indicate  the  presence  of  multiple  transporter  genes  with  similar  substrate 
predictions.  Membrane  transporters  of  unknown  or  putative  subcellular  localization  are 
shown  in  a  generic  membrane  (blue  bar).  Abbreviations:  AGP,  acyl  carrier  protein;  ALA, 
aminolevulinic  acid;  CoA,  coenzyme  A;  DHF,  dihydrofolate;  DOXP,  deoxyxylulose 
phosphate;  FPIX^'^  and  FPIX^'^,  ferro-  and  ferriprotoporphyrin  IX,  respectively;  pABA, 
para-aminobenzoic  acid;  PEP,  phosphoenolpyruvate;  Pj,  phosphate;  PPj,  pyrophosphate; 
PRPP,  phosphoribosyl  pyrophosphate;  THF,  tetrahydrofolate;  UQ,  ubiquinone. 
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phosphate  pathway  were  found.  These  include  a  bifunctional 
glucose-6-phosphate  dehydrogenase/6-phosphogluconate  dehy¬ 
drogenase  required  to  generate  NADPH  and  ribose  5-phosphate 
for  other  biosynthetic  pathways^^’^^  Transaldolase  appears  to  be 
absent,  but  erythrose  4-phosphate  required  for  the  chorismate 
pathway  could  probably  be  generated  from  the  glycolytic  inter¬ 
mediates  fructose  6-phosphate  and  glyceraldehyde  3-phosphate  via 
a  putative  transketolase  (Fig.  4). 

The  genes  necessary  for  a  complete  TCA  cycle,  including  a 
complete  pyruvate  dehydrogenase  complex,  were  identified.  How¬ 
ever,  it  remains  unclear  whether  the  TCA  cycle  is  used  for  the  full 
oxidation  of  products  of  glycolysis,  or  whether  it  is  used  to  supply 
intermediates  for  other  biosynthetic  pathways.  The  pyruvate  dehy¬ 
drogenase  complex  seems  to  be  localized  in  the  apicoplast,  and  the 
only  protein  with  significant  similarity  to  aconitases  has  been 
reported  to  be  a  cytosolic  iron-response  element  binding  protein 
that  did  not  possess  aconitase  activity^\  Also,  malate  dehydrogenase 
appears  to  be  cytosolic  rather  than  mitochondrial,  even  though  it 
seems  to  have  originated  from  the  mitochondrial  genome^^  Genes 
encoding  malate-quinone  oxidoreductase  and  type  I  fiimarate 
dehydratase  are  present.  Malate-quinone  oxidoreductase,  which  is 
probably  targeted  to  the  mitochondrion,  may  well  replace  malate 
dehydrogenase  in  the  TCA  cycle,  as  it  does  in  Helicobacter  pylori.  A 
gene  encoding  phosphoenolpyruvate  carboxylase  (PEPC)  was  also 
found.  Like  bacteria  and  plants,  R  falciparum  may  cope  with  a  drain 
of  TCA  cycle  intermediates  by  using  phosphoenolpyruvate  (PEP)  to 
replenish  oxaloacetate  (Fig.  4).  This  would  seem  to  be  supported  by 
reports  of  C02-incorporating  activity  in  asexual  stage  parasite 
cultures^^  Thus,  the  TCA  cycle  appears  to  be  unconventional  in 
erythrocytic  stages,  and  may  serve  mainly  to  synthesize  succinyl- 
CoA,  which  in  turn  can  be  used  in  the  haem  biosynthesis  pathway. 

Genes  encoding  all  subunits  of  the  catalytic  Fi  portion  of  ATP 
synthase,  the  protein  that  confers  oligomycin  sensitivity,  and  the 
gene  that  encodes  the  proteolipid  subunit  c  for  the  Fq  portion  of 
ATP  synthase,  were  detected  in  the  parasite  genome.  The  Fq  a  and  b 
subunits  could  not  be  detected,  raising  the  question  as  to  whether 
the  ATP  synthase  is  functional.  Because  parts  of  the  genome 
sequence  are  incomplete,  the  presence  of  the  a  and  b  subunits 
could  not  be  ruled  out.  Erythrocytic  parasites  derive  ATP  through 
glycolysis  and  the  mitochondrial  contribution  to  the  ATP  pool  in 
these  stages  appears  to  be  minimaP^’^®.  It  is  possible  that  the  ATP 
synthase  functions  in  the  insect  or  sexual  stages  of  the  parasite. 
However,  in  the  absence  of  the  Fq  a  and  b  subunits,  an  ATP  synthase 
cannot  use  the  proton  gradient’^. 

A  functional  mitochondrion  requires  the  generation  of  an  electro¬ 
chemical  gradient  across  the  inner  membrane.  But  the  R  falciparum 
genome  seems  to  lack  genes  encoding  components  of  a  conven¬ 
tional  NADH  dehydrogenase  complex  I.  Instead,  a  single  subunit 
NADH  dehydrogenase  gene  specifies  an  enzyme  that  can  accom¬ 
plish  ubiquinone  reduction  without  proton  pumping,  thus  consti¬ 
tuting  a  non-electrogenic  step.  Other  dehydrogenases  targeted  to 
the  mitochondrion  also  serve  to  reduce  ubiquinone  in  R  falciparum, 
including  dihydroorotate  dehydrogenase,  a  critical  enzyme  in  the 
essential  pyrimidine  biosynthesis  pathway®*^.  The  parasite  genome 
contains  some  genes  specifying  ubiquinone  synthesis  enzymes,  in 
agreement  with  recent  metabolic  labelling  studies®’.  Re-oxidation  of 
ubiquinol  is  carried  out  by  the  cytochrome  bcl  complex  that 
transfers  electrons  to  cytochrome  c,  and  is  accompanied  by  proton 
translocation®^.  Apocytochrome  b  of  this  complex  is  encoded  by  the 
mitochondrial  genome^' but  the  rest  of  the  components  are 
encoded  by  nuclear  genes.  Ubiquinol  cycling  is  a  critical  step  in 
mitochondrial  physiology,  and  its  selective  inhibition  by  hydroxy- 
naphthoquinones  is  the  basis  for  their  antimalarial  action®^.  The 
final  step  in  electron  transport  is  carried  out  by  the  proton-pumping 
cytochrome  c  oxidase  complex,  of  which  only  two  subunits  are 
encoded  in  the  mitochondrial  DNA  (mtDNA).  In  most  eukaryotes, 
subunit  II  of  cytochrome  c  oxidase  is  encoded  by  a  gene  on  the 
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mitochondrial  genome.  In  R  falciparum,  however,  the  coxll  gene  is 
divided  such  that  the  N-terminal  portion  is  encoded  on  chromo¬ 
some  13  and  the  C-terminal  portion  on  chromosome  14.  A  similar 
division  of  the  coxll  gene  is  also  seen  in  the  unicellular  alga, 
Chlamydomonas  reinhardtiR*.  An  alternative  oxidase  that  transfers 
electrons  directly  from  ubiquinol  to  oxygen  has  been  seen  in  plants 
as  well  in  many  protists,  and  an  earlier  biochemical  study  suggested 
its  presence  in  P.  falciparum^^.  The  genome  sequence,  however,  fails 
to  reveal  such  an  oxidase  gene. 

Biochemical,  genetic  and  chemotherapeutic  data  suggest  that 
malaria  and  other  apicomplexan  parasites  synthesize  chorismate 
from  erythrose  4-phosphate  and  phosphoenolpyruvate  via  the 
shikimate  pathway®^®^  It  was  initially  suggested  that  the  pathway 
was  located  in  the  apicoplast®®,  but  chorismate  synthase  is  phylo- 
genetically  unrelated  to  plastid  isoforms”  and  has  subsequently 
been  localized  to  the  cytosol^’.  The  genes  for  the  preceding  enzymes 
in  the  pathway  could  not  be  identified  with  certainty,  but  a  BLAST? 
search  with  the  S.  cerevisiae  arom  polypeptide’^,  which  catalyses  5  of 
the  preceding  steps,  identified  a  protein  with  a  low  level  of  similarity 
(E  value  7.9  X  10"®). 

In  many  organisms,  chorismate  is  the  pivotal  precursor  to  several 
pathways,  including  the  biosynthesis  of  aromatic  amino  acids  and 
ubiquinone.  We  found  no  evidence,  on  the  basis  of  similarity 
searches,  for  a  role  of  chorismate  in  the  synthesis  of  tryptophan, 
tyrosine  or  phenylalanine,  although  para-aminobenzoate  (pABA) 
synthase  does  have  a  high  degree  of  similarity  to  anthranilate 
(2-amino  benzoate)  synthase,  the  enzyme  catalysing  the  first  step 
in  tryptophan  synthesis  from  chorismate.  In  accordance  with  the 
supposition  that  the  malaria  parasite  obtains  all  of  its  amino  acids 
either  by  salvage  from  the  host  or  by  globin  digestion,  we  found  no 
enzymes  required  for  the  synthesis  of  other  amino  acids  with  the 
exception  of  enzymes  required  for  glycine-serine,  cysteine-alanine, 
aspartate— asparagine,  proline— ornithine  and  glutamine-glutamate 
interconversions.  In  addition  to  pABA  synthase,  all  but  one  of  the 
enzymes  (dihydroneopterin  aldolase)  required  for  de  novo  synthesis 
of  folate  from  GTP  were  identified. 

Several  studies  have  shown  that  the  erythrocytic  stages  of 
P.  falciparum  are  incapable  of  de  novo  purine  synthesis  (reviewed 
in  ref.  80).  This  statement  can  now  be  extended  to  all  life-cycle 
stages,  as  only  adenylsuccinate  lyase,  one  of  the  1 0  enzymes  required 
to  make  inosine  monophosphate  (IMP)  from  phosphoribosyl 
pyrophosphate,  was  identified.  This  enzyme  also  plays  a  role  in 
purine  salvage  by  converting  IMP  to  AMP.  Purine  transporters  and 
enzymes  for  the  interconversion  of  purine  bases  and  nucleosides  are 
also  present.  The  parasite  can  synthesize  pyrimidines  de  novo  from 
glutamine,  bicarbonate  and  aspartate,  and  the  genes  for  each  step 
are  present.  Deoxyribonucleotides  are  formed  via  an  aerobic  ribo- 
nucleoside  diphosphate  reductase”’’"*,  which  is  linked  via  thiore- 
doxin  to  thioredoxin  reductase.  Gene  knockout  experiments  have 
recently  shown  that  thioredoxin  reductase  is  essential  for  parasite 
survival’^. 

The  intraerythrocytic  stages  of  the  malaria  parasite  uses  haemo¬ 
globin  from  the  erythrocyte  cytoplasm  as  a  food  source,  hydrolysing 
globin  to  small  peptides,  and  releasing  haem  that  is  detoxified  in  the 
form  of  haemazoin.  Although  large  amounts  of  haem  are  toxic  to 
the  parasite,  de  novo  haem  biosynthesis  has  been  reported’^  and 
presumably  provides  a  mechanism  by  which  the  parasite  can 
segregate  host-derived  haem  from  haem  required  for  synthesis  of 
its  own  iron-containing  proteins.  However,  it  has  been  unclear 
whether  de  novo  synthesis  occurs  using  imported  host  enzymes’^ 
or  parasite-derived  enzymes.  Genes  encoding  the  first  two  enzymes 
in  the  haem  biosynthetic  pathway,  aminolevulinate  synthase’® 
and  aminolevulinate  dehydratase”,  were  cloned  previously,  and 
genes  encoding  every  other  enzyme  in  the  pathway  except  for 
uroporphyrinogen-III  synthase  were  found  (Fig.  4). 

Haem  and  iron-sulphur  clusters  form  redox  prosthetic  groups 
for  a  wide  range  of  proteins,  many  of  which  are  localized  to  the 
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mitochondrion  and  apicoplast.  The  parasite  genome  appears  to 
encode  enzymes  required  for  the  synthesis  of  these  molecules.  There 
are  two  putative  cysteine  desulphurase  genes,  one  which  also  has 
homology  to  selenocysteine  lyase  and  may  be  targeted  to  the 
mitochondrion,  and  the  second  which  may  be  targeted  to  the 
apicoplast,  suggesting  organelle  specific  generation  of  elemental 
sulphur  to  be  used  in  Fe-S  cluster  proteins.  The  subcellular 
localization  of  the  enzymes  involved  in  haem  synthesis  is  uncertain. 
Ferrochelatase  and  two  haem  lyases  are  likely  to  be  localized  in  the 
mitochondrion. 

The  role  of  the  apicoplast  in  type  II  fatty-acid  biosynthesis  was 
described  previously^’^’'.  The  genes  encoding  all  enzymes  in  the 
pathway  have  now  been  elucidated,  except  for  a  thioesterase 
required  for  chain  termination.  No  evidence  was  found  for  the 
associative  (type  I)  pathway  for  fatty-acid  biosynthesis  common  to 
most  eukaryotes.  The  apicoplast  also  houses  the  machinery  for 
mevalonate-independent  isoprenoid  synthesis.  Because  it  is  not 
present  in  mammals,  the  biosynthesis  of  isopentyl  diphosphate 
firom  pyruvate  and  glyceraldehyde-3-phosphate  provides  several 
attractive  targets  for  chemotherapy.  Three  enzymes  in  the  pathway 
have  been  identified,  including  l-deoxy-D-xylulose-5-phosphate 
synthase,  l-deoxy-D-xylulose-5-phosphate  reductoisome^ase^^ 
and  2C-methyl-D-erythritol  2,4-cyclodiphosphate  synthase^®°*^°\ 
One  predicted  protein  was  similar  to  the  fourth  enzyme,  2C- 
methyl-D-erythritol-4-phosphate  cytidyltransferase  (BLAST? 
E  value  9.6  X  10”^^). 

Transport 

On  the  basis  of  genome  analysis,  R  falciparum  possesses  a  very 
limited  repertoire  of  membrane  transporters,  particularly  for 
uptake  of  organic  nutrients,  compared  to  other  sequenced  eukar¬ 
yotes  (Fig.  5).  For  instance,  there  are  only  six  R  falciparum  members 
of  the  major  facilitator  superfamily  (MFS)  and  one  member  of  the 
amino  acid/polyamine/choline  APC  family,  less  than  10%  of  the 
numbers  seen  in  S.  cerevisiae,  S.  pombe  or  Caenorhahditis  elegans 
(Fig.  5).  The  apparent  lack  of  solute  transporters  in  R  falciparum 
correlates  with  the  lower  percentage  of  multispanning  membrane 
proteins  compared  with  other  eukaryotic  organisms  (Fig.  5).  The 
predicted  transport  capabilities  of  R  falciparum  resemble  those  of 
obligate  intracellular  prokaryotic  parasites,  which  also  possess  a 
Hmited  complement  of  transporters  for  organic  solutes^®^. 

A  complete  catalogue  of  the  identified  transporters  is  presented  in 
Fig.  4.  In  addition  to  the  glucose/proton  symporter^®^  and  the  water/ 
glycerol  channeP®^,  one  other  probable  sugar  transporter  and  three 
carboxylate  transporters  were  identified;  one  or  more  of  the  latter 
are  probably  responsible  for  the  lactate  and  pyruvate/proton  sym- 
port  activity  of  R  falciparum^^^.  Two  nucleoside/nucleobase  trans¬ 
porters  are  encoded  on  the  R  falciparum  genome,  one  of  which  has 
been  localized  to  the  parasite  plasma  membrane^®®.  No  obvious 
amino-acid  transporters  were  detected,  which  emphasizes  the 
importance  of  haemoglobin  digestion  within  the  food  vacuole  as 
an  important  source  of  amino  acids  for  the  erythrocytic  stages  of  the 
parasite.  How  the  insect  stages  of  the  parasite  acquire  amino  acids 
and  other  important  nutrients  is  unknown,  but  four  metabolic 
uptake  systems  were  identified  whose  substrate  specificity  could  not 
be  predicted  with  confidence.  The  parasite  may  also  possess  novel 
proteins  that  mediate  these  activities.  Nine  members  of  the  mito¬ 
chondrial  carrier  family  are  present  in  R  falciparum,  including  an 
ATP/ADP  exchanger*®^  and  a  di/tri-carboxylate  exchanger,  probably 
involved  in  transport  of  TCA  cycle  intermediates  across  the  mito¬ 
chondrial  membrane.  Probable  phosphoenolpyruvate/phosphate 
and  sugar  phosphate/phosphate  antiporters  most  similar  to  those 
of  plant  ddoroplasts  were  identified,  suggesting  that  these  trans¬ 
porters  are  targeted  to  the  apicoplast  membrane.  The  former  may 
enable  uptake  of  phosphoenolpyruvate  as  a  precursor  of  fatty-acid 
biosynthesis. 

A  more  extensive  set  of  transporters  could  be  identified  for 
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Figure  5  Analysis  of  transporters  in  P.  falciparum,  a,  Comparison  of  the  numbers  of 
transporters  belonging  to  the  major  facilitator  superfamily  (MFS),  ATP-binding  cassette 
(ABC)  family,  P-type  ATPase  family  and  the  amino  acid/polyamine/choline  (APC)  family  in 
P.  falciparum  and  other  eukaryotes.  Analyses  were  performed  as  previously  described^®^. 
b,  Comparison  of  the  numbers  of  proteins  with  ten  or  more  predicted  transmembrane 
segments^®^  (TMS)  in  P.  falciparum  and  other  eukaryotes.  Prediction  of  membrane 
spanning  segments  was  performed  using  TMHMM. 


the  transport  of  inorganic  ions  and  for  export  of  drugs  and 
hydrophobic  compounds.  Sodium/proton  and  calcium/proton 
exchangers  were  identified,  as  well  as  other  metal  cation  transpor¬ 
ters,  including  a  substantial  set  of  16  P-type  ATPases.  An  Nramp 
divalent  cation  transporter  was  identified  which  may  be  specific  for 
manganese  or  iron.  Plasmodium  falciparum  contains  all  subunits  of 
V-type  ATPases  as  well  as  two  proton  translocating  pyrophospha¬ 
tases^®®,  which  could  be  used  to  generate  a  proton  motive  force, 
possibly  across  the  parasite  plasma  membrane  as  well  as  across  a 
vacuolar  membrane.  The  proton  pumping  pyrophosphatases  are 
not  present  in  mammals,  and  could  form  attractive  antimalarial 
targets.  Only  a  single  copy  of  the  R  falciparum  chloroquine- 
resistance  gene  crt  is  present,  but  multiple  homologues  of  the 
multidrug  resistance  pump  mdrl  and  other  predicted  multidrug 
transporters  were  identified  (Fig.  3).  Mutations  in  crt  seem  to  have  a 
central  role  in  the  development  of  chloroquine  resistances®^. 

Plasmodium  falciparum  infection  of  erythrocytes  causes  a  variety 
of  pleiotropic  changes  in  host  membrane  transport.  Patch  clamp 
andysis  has  described  a  novel  broad-specificity  channel  activated  or 
inserted  in  the  red  blood  ceU  membrane  by  R  falciparum  infection 
that  allows  uptake  of  various  nutrientsss®.  If  this  channel  is  encoded 
by  the  parasite,  it  is  not  obvious  from  genome  analysis,  because  no 
clear  homologues  of  eukaryotic  sodium,  potassium  or  chloride  ion 
channels  could  be  identified.  This  suggests  that  R  falciparum  may 
use  one  or  more  novel  membrane  channels  for  this  activity. 


DNA  replication,  repair  and  recombination 

DNA  repair  processes  are  involved  in  maintenance  of  genomic 
integrity  in  response  to  DNA  damaging  agents  such  as  irradiation, 
chemicals  and  oxygen  radicals,  as  well  as  errors  in  DNA  metabolism 
such  as  misincorporation  during  DNA  replication.  The  R  falci¬ 
parum  genome  encodes  at  least  some  components  of  the  major 
DNA  repair  processes  that  have  been  found  in  other  eukary- 
otesiii‘112  Thg  (i-ore  of  eukaryotic  nucleotide  excision  repair  is 
present  (XPB/Rad25,  XPG/Rad2,  XPF/Radl,  XPD/Rad3,  ERCCl) 
although  some  highly  conserved  proteins  with  more  accessory  roles 
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could  not  be  found  (for  example,  XPA/Rad4,  XPC).  The  same  is  true 
for  homologous  recombinational  repair  with  core  proteins  such  as 
MREll,  DMCl,  RadSO  and  RadSl  present  but  accessory  proteins 
such  as  NBSl  and  XRS2  not  yet  found.  These  accessory  proteins 
tend  to  be  poorly  conserved  and  have  not  been  found  outside  of 
animals  or  yeast,  respectively,  and  thus  may  be  either  absent  or 
difficult  to  identify  in  P.  falciparum.  However,  it  is  interesting  that 
Archaea  possess  many  of  the  core  proteins  but  not  the  accessory 
proteins  for  these  repair  processes,  suggesting  that  many  of  the 
accessory  eukaryotic  repair  proteins  evolved  after  P.  falciparum 
diverged  from  other  eukaryotes. 

The  presence  of  MutL  and  MutS  homologues  including  possible 
orthologues  of  MSH2,  MSH6,  MLHl  and  PMSl  suggests  that 
P.  falciparum  can  perform  post-replication  mismatch  repair.  Ortho¬ 
logues  of  MSH4  and  MSH5,  which  are  involved  in  meiotic  crossing 
over  in  other  eukaryotes,  are  apparently  absent  in  R  falciparum.  The 
repair  of  at  least  some  damaged  bases  may  be  performed  by  the 
combined  action  of  the  four  base  excision  repair  glycosylase 
homologues  and  one  of  the  apurinic/apyrimidinic  (AP)  endonu¬ 
cleases  (homologues  of  Xth  and  Nfo  are  present).  Experimental 
evidence  suggests  that  this  is  done  by  the  long-patch  pathway"  ^ 

The  presence  of  a  class  II  photolyase  homologue  is  intriguing, 
because  it  is  not  clear  whether  P.  falciparum  is  exposed  to  significant 
amounts  of  ultraviolet  irradiation  during  its  life  cycle.  It  is  possible 
that  this  protein  functions  as  a  blue-light  receptor  instead  of  a 
photolyase,  as  do  members  of  this  gene  family  in  some  organisms 
such  as  humans.  Perhaps  most  interesting  is  the  apparent  absence  of 
homologues  of  any  of  the  genes  encoding  enzymes  known  to  be 
involved  in  non-homologous  end  joining  (NHEJ)  in  eukaryotes  (for 
example,  Ku70,  Ku86,  Ligase  IVand  XRCCl)"^  NHEJ  is  involved  in 
the  repair  of  double  strand  breaks  induced  by  irradiation  and 
chernicals  in  other  eukaryotes  (such  as  yeast  and  humans),  and  is 
also  involved  in  a  few  cellular  processes  that  create  double  strand 
breaks  (for  example,  VDJ  recombination  in  the  immune  system  in 
humans).  The  role  of  NHEJ  in  repairing  radiation-induced  double 
strand  breaks  varies  between  species"^  For  example,  in  humans, 
cells  with  defects  in  NHEJ  are  highly  sensitive  to  7 -irradiation  while 
yeast  mutants  are  not.  Double  strand  breaks  in  yeast  are  repaired 
primarily  by  homologous  recombination.  As  NHEJ  is  involved  in 
regulating  telomere  stability  in  other  organisms,  its  apparent 
absence  in  P,  falciparum  may  explain  some  of  the  unusual  properties 
of  the  telomeres  in  this  species"^. 

Secretory  pathway 

Plasmodium  falciparum  contains  genes  encoding  proteins  that  are 
important  in  protein  transport  in  other  eukaryotic  organisms,  but 
the  organelles  associated  with  a  classical  secretory  pathway  and 
protein  transport  are  difficult  to  discern  at  an  ultra-structural 
level"^.  In  order  to  identify  additional  proteins  that  may  have  a 
role  in  protein  translocation  and  secretion,  the  P,  falciparum  protein 
database  was  searched  with  S,  cerevisiae  proteins  with  GO  assign¬ 
ments  for  involvement  in  protein  export.  We  identified  potential 
homologues  of  important  components  of  the  signal  recognition 
particle,  the  translocon,  the  signal  peptidase  complex  and  many 
components  that  allow  vesicle  assembly,  docking  and  fusion,  such  as 
COPI  and  COPII,  clathrin,  adaptin,  v-  and  t-SNARE  and  GTP 
binding  proteins.  The  presence  of  Sec62  and  Sec63  orthologues 
raises  the  possibility  of  post-translational  translocation  of  proteins, 
as  found  in  S.  cerevisiae. 

Although  P.  falciparum  contains  many  of  the  components  associ¬ 
ated  with  a  classical  secretory  system  and  vesicular  transport  of 
proteins,  the  parasite  secretory  pathway  has  unusual  features.  The 
parasite  develops  within  a  parasitophorous  vacuole  that  is  formed 
during  the  invasion  of  the  host  cell,  and  the  parasite  modifies  the 
host  erythrocyte  by  the  export  of  parasite-encoded  proteins"^  The 
mechanism(s)  by  which  these  proteins,  some  of  which  lack  signal 
peptide  sequences,  are  transported  through  and  targeted  beyond  the 
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membrane  of  the  parasitophorous  vacuole  remains  unknown.  But 
these  mechanisms  are  of  particular  importance  because  many  of  the 
proteins  that  contribute  to  the  development  of  severe  disease  are 
exported  to  the  cytoplasm  and  plasma  membrane  of  infected 
erythrocytes. 

Attempts  to  resolve  these  observations  resulted  in  the  proposal  of 
a  secondary  secretory  pathway"®.  More  recent  studies  suggest 
export  of  COPII  vesicle  coat  proteins,  Sari  and  Sec31,  to  the 
erythro<^e  cytoplasm  as  a  mechanism  of  inducing  vesicle  for¬ 
mation  in  the  host  cell,  thereby  targeting  parasite  proteins  beyond 
the  parasitophorous  vacuole,  a  new  model  in  cell  biology"’- A 
homologue  of  N-ethylmaleimide-sensitive  factor  (NSF),  a  compo¬ 
nent  of  vesicular  transport,  has  also  been  located  to  the  erythrocyte 
cytoplasm*^\  The  41-2  antigen  of  P.  falciparumy  which  is  also  found 
in  the  erythrocyte  cytoplasm  and  plasma  membrane'^^,  is  homolo¬ 
gous  with  BET3,  a  subunit  of  the  S.  cerevisiae  transport  protein 
particle  (TRAPP)  that  mediates  endoplasmic  reticulum  to  Golgi 
vesicle  docking  and  fusion‘^^  It  is  not  clear  how  these  proteins  are 
targeted  to  the  cytoplasm,  as  they  lack  an  obvious  signal  peptide. 
Nevertheless,  the  expanded  list  of  protein-transport-associated 
genes  identified  in  the  P.  falciparum  genome  should  facilitate  the 
development  of  specific  probes  to  further  elucidate  the  intra-  and 
extracellular  compartments  of  its  protein  transport  system. 

Immune  evasion 

In  common  with  other  organisms,  highly  variable  gene  families  are 
clustered  towards  the  telomeres.  Plasmodium  falciparum  contains 
three  such  families  termed  v^zr,  rif  and  stevor,  which  code  for 
proteins  known  as  R  falciparum  er^rocyte  membrane  protein  1 
(PfEMPl),  repetitive  interspersed  family  (rifin)  and  sub-telomeric 
variable  open  reading  frame  (stevor),  respectively®-*^^^^®.  The  3D7 
genome  contains  59  var,  149  rif  and  28  stevor  genes,  but  for  each 
family  there  are  also  a  number  of  pseudogenes  and  gene  truncations 
present. 

The  var  genes  code  for  proteins  which  are  exported  to  the  surface 
of  infected  red  blood  cells  where  they  mediate  adherence  to 
host  endothelial  receptors^^^  resulting  in  the  sequestration  of 
infected  cells  in  a  variety  of  organs.  These  and  other  adherence 
properties*^^-*^^  are  important  virulence  factors  that  contribute  to 
the  development  of  severe  disease.  Rifins,  products  of  the  n/genes, 
are  also  expressed  on  the  surface  of  infected  red  cells  and  undergo 
antigenic  variation*^*.  Proteins  encoded  by  stevor  genes  show 
sequence  similarity  to  rifins,  but  they  are  less  polymorphic  than 
the  rifins^^’.  The  function  of  rifins  and  stevors  is  unknown.  PfEMPl 
proteins  are  targets  of  the  host  protective  antibody  response*^,  but 
transcriptional  switching  between  var  genes  permits  antigenic 
variation  and  a  means  of  immune  evasion,  facilitating  chronic 
infection  and  transmission.  Products  of  the  var  gene  family  are 
thus  central  to  the  pathogenesis  of  malaria  and  to  the  induction  of 
protective  immunity. 

Figure  6  shows  the  genome- wide  arrangement  of  these  multigene 
families.  In  the  24  chromosomal  ends  that  have  a  var  gene  as  the  first 
transcriptional  unit,  there  are  three  basic  types  of  gene  arrangement. 
Eight  have  the  general  pattern  var-rif  var  -f  /-  {rif! stevor) ten  can 
be  described  as  var^  {rif! stevor) three  have  a  var  gene  alone  and  two 
have  two  or  more  adjacent  var  genes.  This  telomeric  organization  is 
consistent  with  exchange  between  chromosome  ends,  although  the 
extent  of  this  re-assortment  may  be  limited  by  the  varied  gene 
combinations.  The  var,  rif  and  stevor  genes  consist  of  two  exons.  The 
first  var  exon  is  between  3.5  and  9.0  kb  in  length,  polymorphic  and 
encodes  an  extracellular  region  of  the  protein.  The  second  exon  is 
between  1.0  and  1.5  kb,  and  encodes  a  conserved  cytoplasmic  tail 
that  contains  acidic  amino-acid  residues  (ATS;  ‘acidic  terminal 
sequence’).  The  first  rif  and  stevor  exons  are  about  50-75 bp  in 
length,  and  encode  a  putative  signal  sequence  while  the  second  exon 
is  about  1  kb  in  length,  with  the  rif  exon  being  on  average  slightly 
larger  than  that  for  stevor.  The  rifin  sequences  fall  into  two  major 
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subgroups  determined  by  the  presence  or  absence  of  a  consensus 
peptide  sequence,  KEL  (X15)  IPTCVCR,  approximately  100  amino 
acids  from  the  N  terminus.  The  var  genes  are  made  up  of 
three  recognizable  domains  known  as  Duffy  binding  like 
(DHL);  'cysteine  rich  interdomain  region’  (CIDR)  and  'constant!’ 

Alignment  of  sequences  existing  before  the  P.  falciparum 
genome  project  had  placed  each  of  these  domains  into  a  number  of 
sub-classes;  a  to  s  for  DEL  domains,  and  a  to  7  for  CIDR  domains. 
Despite  these  recognizable  signatures,  there  is  a  low  level  of  sequence 
similarity  even  between  domains  of  the  same  sub-type.  Alignment 
and  tree  construction  of  the  DEL  domains  identified  here  showed 
that  a  small  number  did  not  fit  well  into  existing  categories,  and 
have  been  termed  DBL-X.  Similar  analysis  of  all  3D7  CIDR 
sequences  showed  that  with  this  data  they  were  best  described  as 
CIDRot  or  CIDR  non-ot,  as  distinct  tree  branches  for  the  other 
domain  types  were  not  observed.  In  terms  of  domain  type  and 
order,  16  types  of  var  gene  sequences  were  identified  in  this  study. 

Type  1  var  genes,  consisting  of  DBLa,  CIDRa,  DBL5,  and  CIDR 
non-a  followed  by  the  ATS,  are  the  most  common  structures,  with 
38  genes  in  this  category  (Fig.  6b).  A  total  of  58  var  genes  commence 
with  a  DBLa  domain,  and  in  51  cases  this  is  followed  by  CIDRa,  and 
in  46  var  genes  the  last  domain  of  the  first  exon  is  CIDR  non-a.  Four 
var  genes  are  atypical  with  the  first  exon  consisting  solely  of  DBL 
domains  (type  3  and  type  13).  There  is  non-randomness  in  the 
ordering  and  pairing  of  DBL  and  CIDR  sub-domains  ,  suggesting 
that  some — for  example,  DBL8-CIDR  non-a  and  DBLp-C2 


(Table  3)— should  either  be  considered  as  functional-structural 
combinations,  or  that  recombination  in  these  areas  is  not  favoured, 
thereby  preserving  the  arrangement.  Eighteen  of  the  24  telomeric 
proximal  var  genes  are  of  type  1.  With  two  exceptions,  type  4  on 
chromosome  7  and  type  9  on  chromosome  11,  all  of  the  telomeric 
var  genes  are  transcribed  towards  the  centromere.  The  inverted 
position  of  the  two  var  genes  may  hinder  homologous  recombina¬ 
tion  at  these  loci  in  telomeric  clusters  that  are  formed  during  asexual 
multiplication“^  A  further  12  var  genes  are  located  near  to 
telomeres,  with  the  remaining  var  genes  forming  internal  clusters 
on  chromosomes  4,  7,  8  and  12  and  a  single  internal  gene  being 
located  on  chromosome  6. 

Alignment  of  sequences  1 .5  kb  upstream  of  aU  of  the  var  genes 
revealed  three  classes  of  sequences,  upsA,  upsB  and  upsC  (of  which 
there  are  11,  35  and  13  members,  respectively)  that  show  prefer¬ 
ential  association  with  different  var  genes.  Thus,  upsB  is  associated 
with  22  out  of  24  telomeric  var  genes,  upsA  is  found  with  the  two 
remaining  telomeric  var  genes  that  are  transcribed  towards  the 
telomere  and  with  most  telomere  associated  var  genes  (9  out  of  12) 
which  also  point  towards  the  telomere^^^  All  13  upsC  sequences  are 
associated  with  internal  var  clusters.  Nearly  all  the  telomeric  var 
genes  have  an  (A  +T)-rich  region  approximately  2  kb  upstream 
characterized  by  a  number  of  poly(A)  tracts  as  well  as  one  or  more 
copies  of  the  consensus  GGATCTAG.  An  analysis  of  the  regions 
1.0  kb  downstream  of  var  genes  shows  three  sequence  families,  with 
members  of  one  family  being  associated  primarily  with  var  genes 
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Figure  6  Organization  of  multi-gene  families  in  P.  falciparum,  a,  Telomeric  regions  of  all 
chromosomes  showing  the  relative  positions  of  members  of  the  multi-gene  families:  rif 
(blue)  stei/or  (yellow)  and  i/ar  (colour  coded  as  indicated;  see  b  and  c).  Grey  boxes 
represent  pseudogenes  or  gene  fragments  of  any  of  these  families.  The  left  telomere  is 
shown  above  the  right.  Scale:  -0.6  mm  =  1  kb.  b.  c,  var  gene  domain  structure,  var 
genes  contain  three  domain  types:  DBL,  of  which  there  are  six  sequence  classes;  CIDR,  of 


which  there  are  two  sequence  classes;  and  conserved  2  (C2)  domains  (see  text).  The 
relative  order  of  the  domains  in  each  gene  is  indicated  (c).  var  genes  with  the  same 
domain  types  in  the  same  order  have  been  colour  coded  as  an  identical  class  and  given  an 
arbitrary  number  for  their  type  (b)  and  the  total  number  of  members  of  each  class  in  the 
genome  of  P.  falciparum  clone  3D7.  d,  Internal  multi-gene  family  clusters.  Key  as  in  a. 
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next  to  the  telomeric  repeats.  The  intron  sequences  within  the  var 
genes  have  been  associated  with  locus  specific  silencing'".  They  vary 
in  length  fi-om  1 70  to  ~  1 ,200  bp  and  are  -89%  A/T.  On  the  coding 
strand,  at  the  5  end  the  non-A/T  bases  are  mainly  G  residues  with 
70%  of  sequences  having  the  consensus  TGTTTGGATATATA.  The 
central  regions  are  highly  A-rich,  and  contain  a  number  of  semi- 
conserved  motifs.  The  3'  region  is  comparably  rich  in  C,  with  one  or 
more  copies  in  most  genes  of  the  sequence  (TA)„  CCCATAAC- 
TACA.  The  3  end  has  an  extended  and  atypical  splice  consensus  of 
ACANATATAGTTA(T)„  TAG.  Sequences  upstream  of  rif  and  stevor 
genes  also  have  distinguishable  upstream  sequences,  but  a  pro- 
portion  of  rif  genes  have  the  stevor  type  of  5 '  sequence.  Because  the 
majority  of  telomeric  var  genes  share  a  similar  structure  and  5'  and 
3  sequences,  they  may  form  a  unique  group  in  terms  of  regulation 
of  gene  expression. 

The  most  conserved  var  gene  previously  identified,  which  medi¬ 
ates  adherence  to  chondroitin  sulphate  A  in  the  placenta'",  is 
incomplete  in  3D7  because  of  deletion  of  part  of  exon  1  and  all  of 
exon  2.  This  gene  is  located  on  the  right  telomere  of  chromosome  5 
(Fig  6).  The  majority  of  var  genes  sequenced  previously  had  been 
identified  as  they  mediated  adhesion  to  particular  receptors,  and 
most  of  them  had  more  than  four  domains  in  exon  1.  The  fact  that 
type  1  var  genes  containing  only  4  domains  predominate  in  the  3D7 
genome  suggests  that  previous  analyses  had  been  based  on  a  highly 
biased  sample.  The  significance  of  this  in  terms  of  the  function  of 
type  1  var  genes  remains  to  be  determined. 

Immune-evasion  mechanisms  such  as  clonal  antigenic  variation 
of  parasite-derived  red  cell  surface  proteins  (PfEMPls,  rifins)  and 
modulation  of  dendritic  cell  function  have  been  documented  in 
P.  falciparuw'^'-'^K  A  putative  homologue  of  human  cytokine 
macrophage  migration  inhibitory  factor  (MIF)  was  identified  in 
P.  falciparum.  In  vertebrates,  MIFs  have  been  shovm  to  function  as 
immuno-modulators  and  as  growth  factors'",  and  in  the  nematode 
Brugia  malayi,  recombinant  MIF  modulated  macrophage  migration 
and  promoted  parasite  survival'".  An  MIF-type  protein  in 
P.  falciparum  may  contribute  to  the  parasite’s  ability  to  modulate 
the  immune  response  by  molecular  mimicry  or  participate  in  other 
host-parasite  interactions. 

implications  for  vaccine  deveiopment 

An  effective  malaria  vaccine  must  induce  protective  immune 
responses  equivalent  to,  or  better  than,  those  provided  by  naturaUy 
acquired  immunity  or  immunization  with  attenuated  sporo¬ 
zoites''"'.  To  date,  about  30  P.  falciparum  antigens  that  were 


Table  3  Domains  of  PtEMPI  proteins  In  P.  falciparum 


Domain  type 


Number  of  domains 


DBLa 

DBi^-C2 

DBL"^ 

DBLS 
DBU 
DBL-X 
CIDRa 
CIDR  non-ot 


58 

18 

13 

44 

13 

13 

51 

54 


Preferred  pairings 


Frequency 


DBLa-CIDRa 

DB13-C2 

DBL5-CIDR  non-a 

CIDRa-DBU 

CIDRa-OBL0 

DBL3-C2-DBL*y 

DBL-y-DBL-X 


51/58 

18/18 

44/44 

39/51 

10/51 

10/18 

8/13 


T^.ftetotalnum^of^DBLorCIDR  domain  type  in  intact  i^argenes  within  IheP.fefc^samm 

am  genome.  Bottom,  the  frequencies  of  the  most  common  individual  domain  pairings  found 
wmrn  intact  var  genes.  ^  denominator  refers  to  the  total  number  of  the  first-named  domains  in 

S/m  “cond-named  domains  found 

adjacent.  See  text  for  discussion  of  domain  types. 


identified  via  conventional  techniques  are  being  evaluated  for  use 
in  vaccines,  and  several  have  been  tested  in  clinical  trials.  Partial 
protection  with  one  vaccine  has  recently  been  attained  in  a  field 
setting*'*^  The  present  genome  sequence  will  stimulate  vaccine 
development  by  the  identification  of  hundreds  of  potential  antigens 
that  could  be  scanned  for  desired  properties  such  as  surface 
expression  or  limited  antigenic  diversity.  This  could  be  combined 
with  data  on  stage-specific  expression  obtained  by  microarray  and 
proteomics*'*’*^  analyses  to  identify  potential  antigens  that  are 
expressed  in  one  or  more  stages  of  the  life  cycle.  However,  high- 
throughput  immunological  assays  to  identify  novel  candidate 
vaccine  antigens  that  are  the  targets  of  protective  humoral  and 
cellular  immune  responses  in  humans  need  to  be  developed  if  the 
genome  sequence  is  to  have  an  impact  on  vaccine  development.  In 
addition,  new  methods  for  maximizing  the  magnitude,  quality  and 
longevity  of  protective  immune  responses  will  be  required  in  order 
to  produce  effective  malaria  vaccines. 

Concluding  remarks 

The  R  falciparuniy  Anopheles  gambiae  and  Homo  sapiens  genome 
sequences  have  been  completed  in  the  past  two  years,  and  represent 
new  starting  points  in  the  centuries-long  search  for  solutions  to  the 
malaria  problem.  For  the  first  time,  a  wealth  of  information  is 
available  for  dl  three  organisms  that  comprise  the  life  cycle  of  the 
malaria  parasite,  providing  abundant  opportunities  for  the  study  of 
each  species  and  their  complex  interactions  that  result  in  disease. 
The  rapid  pace  of  improvements  in  sequencing  technology  and  the 
declining  costs  of  sequencing  have  made  it  possible  to  begin  genome 
sequencing  efforts  for  Plasmodium  vivaxy  the  second  major  human 
malaria  parasite,  several  malaria  parasites  of  animals,  and  for  many 
related  parasites  such  as  Theileria  and  Toxoplasma.  These  will  be 
ertremely  useful  for  comparative  purposes.  Last,  this  technology 
will  enable  sampling  of  parasite,  vector  and  host  genomes  in  the 
field,  providing  information  to  support  the  development,  deploy¬ 
ment  and  monitoring  of  malaria  control  methods. 

In  the  short  term,  however,  the  genome  sequences  alone  provide 
little  relief  to  those  suffering  from  malaria.  The  work  reported  here 
and  elsewhere  needs  to  be  accompanied  by  larger  efforts  to  develop 
new  methods  of  control,  including  new  drugs  and  vaccines, 
improved  diagnostics  and  effective  vector  control  techniques. 
Much  remains  to  be  done.  Clearly,  research  and  investments  to 
develop  and  implement  new  control  measures  are  needed  despe¬ 
rately  if  the  social  and  economic  impacts  of  malaria  are  to  be 
relieved.  The  increased  attention  given  to  malaria  (and  to  other 
infectious  diseases  affecting  tropical  countries)  at  the  highest  levels 
of  government,  and  the  initiation  of  programmes  such  as  the  Global 
Fund  to  Fight  AIDS,  Tuberculosis  and  Malaria*"®,  the  Multilateral 
Initiative  on  Malaria  in  Affica*"^  the  Medicines  for  Malaria  Ven¬ 
ture*®®,  and  the  Roll  Back  Malaria  campaign*®*,  provide  some  hope 
of  progress  in  this  area.  It  is  our  hope  and  expectation  that 
researchers  around  the  globe  will  use  the  information  and  biological 
insights  provided  by  complete  genome  sequences  to  accelerate  the 
search  for  solutions  to  diseases  affecting  the  most  vulnerable  of  the 
world’s  population.  □ 

Methods 

Sequencing,  gap  closure  and  annotation 

The  techniques  used  at  each  of  the  three  participating  centres  for  sequencing,  closure  and 
annotation  are  described  in  the  accompanying  Letters^"’.  To  ensure  that  each  centres’ 
annotation  procedures  produced  roughly  equivalent  results,  the  WeUcome  Trust  Sanger 
Institute  (‘Sanger’)  and  the  Institute  for  Genomic  Research  (‘TIGR’)  annotated  the  same 
100-kb  segment  of  chromosome  1 4.  The  number  of  genes  predicted  in  this  sequence  by  the 
two  centres  was  22  and  23;  the  discrepancy  being  due  to  the  merging  of  two  single  genes  by 
one  centre.  Of  the  74  exons  predicted  by  the  two  centres,  50  (68%)  were  identical,  9  (2%) 
overlapped,  6  (8%)  overlapped  and  shared  one  boundary,  and  the  remainder  were 
predicted  by  one  centre  but  not  the  other.  Thus  88%  of  the  exons  predicted  by  the  two 
centres  in  the  100-kb  fragment  were  identical  or  overlapped. 

Finished  sequence  data  and  annotation  were  transferred  in  XML  (extensible  markup 
language)  format  from  Sanger  and  the  Stanford  Genome  Technology  Center  to  TIGR,  and 
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made  available  to  co-authors  over  the  internet.  Genes  on  finished  chromosomes  were 
assigned  systematic  names  according  the  scheme  described  previously^ .  Genes  on 
unfinished  chromosomes  were  given  temporary  identifiers. 

Analysis  of  subtelomeric  regions 

Subtelomeric  regions  were  analysed  by  the  alignment  of  all  of  the  chromosomes  to  each 
other  using  MUMmer2‘“  with  a  minimum  exact  match  length  ranging  from  30  to  50  bp. 
Tandem  repeats  were  identified  by  extracting  a  90-kb  region  from  the  ends  of  all 
chromosomes  and  using  Tandem  Repeat  Finder^”  with  the  following  parameter  settings: 
match  =  2,  mismatch  =  7,  indel  =  7,  pm  =  75,  pi  =  10,  minscore  =  100, 
maxperiod  =  500.  Detailed  pairwise  alignments  of  internal  telomeric  blocks  were 
computed  with  the  ssearch  program  from  the  Fasta3  package*^. 

Evolutionary  analyses 

Plasmodium  falciparum  proteins  were  searched  against  a  database  of  proteins  from  all 
complete  genomes  as  well  as  from  a  set  of  organelle,  plasmid  and  viral  genomes.  Putative 
recently  duplicated  genes  were  identified  as  those  encoding  proteins  with  better  BLAST? 
matches  (based  on  E  value  with  a  10“^^  cutoff)  to  other  proteins  in  P.  falciparum  than  to 
proteins  in  any  other  species.  Proteins  of  possible  organellar  descent  were  identified  as 
those  for  which  one  of  the  top  six  prokaryotic  matches  (based  on  E  value)  was  to  either  a 
protein  encoded  by  an  organelle  genome  or  by  a  species  related  to  the  organelle  ancestors 
(members  of  the  Rickettsia  subgroup  of  the  a-Proteobacteria  or  cyanobacteria).  Because 
BLAST  matches  are  not  an  ideal  method  of  inferring  evolutionary  history,  phylogenetic 
analysis  was  conducted  for  all  these  proteins.  For  phylogenetic  analysis,  all  homologues  of 
each  protein  were  identified  by  BLAST?  searches  of  complete  genomes  and  of  a  non- 
redundant  protein  database.  Sequences  were  aligned  using  CLUSTALW,  and  phylogenetic 
trees  were  inferred  using  the  neighbour-joining  algorithms  of  CLUSTALW  and  PHYLIP. 
For  comparative  analysis  of  eukaryotes,  the  proteomes  of  all  eukaryotes  for  which 
complete  genomes  are  available  (except  the  highly  reduced  E.  cuniculi)  were  searched 
against  each  other.  The  proportion  of  proteins  in  each  eukaryotic  species  that  had  a 
BLASTP  match  in  each  of  the  other  eukaryotic  species  was  determined,  and  used  to  infer  a 
‘whole-genome  tree’  using  the  neighbour-joining  algorithm.  Possible  eukaryotic 
conserved  and  specific  proteins  were  identified  as  those  with  matches  to  all  the  complete 
eukaryotic  genomes  (10“^°  E-value  cutoff)  but  without  matches  to  any  complete 
prokaryotic  genome  (lO”^^  cutoff). 
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here  the  nucleotide  sequences  of  chromosomes  10, 11  and  14,  and 
a  re-analysis  of  the  chromosome  2  sequence®.  These  chromo¬ 
somes  represent  about  35%  of  the  23-megabase  P.  falciparum 
genome. 

P.  falciparum  chromosomes  were  resolved  on  preparative  pulsed 
field  gels,  and  used  to  prepare  shotgun  libraries  of  1— 2-kilobase  (kb) 
DNA  fragments  in  plasmid  vectors.  Sequences  of  randomly  selected 
clones  were  assembled,  and  gaps  were  closed  using  primer  walking 
on  plasmid  templates  or  polymerase  chain  reaction  (PCR)  prod¬ 
ucts.  The  cross-contamination  of  the  chromosomal  libraries  with 
sequences  fi-om  other  chromosomes  (up  to  25%)  and  the  high 
(A  +  T)  content  (80.6%)  of  P,  falciparum  DNA  caused  extreme 
difficulties  in  the  gap  closure  process.  Intergenic  regions  and  introns 
frequently  contained  long  runs  of  up  to  50  consecutive  A  or  T 
residues  that  were  difficult  to  clone  and  sequence.  The  high  (A  H-  T) 
content  of  the  chromosomes  also  prevented  the  construction  of 
large  insert  libraries  that  could  be  used  to  construct  scaffolds  of 
ordered  and  oriented  contiguous  DNA  sequences  (contigs)  during 
assembly.  Similar  but  more  severe  problems  were  reported  in  the 
sequencing  of  the  (A  -b  T)-rich  chromosome  2  of  the  slime  mould 
Dictyostelium  discoideum\  illustrating  the  need  to  develop  better 


methods  for  the  cloning  and  sequencing  of  very  (A  +  T)-rich 
genomes.  The  reported  sequences  contain  three  or  four  short  gaps 
(<2kb)  in  each  chromosome,  Contigs  comprising  these  chromo¬ 
somes  were  joined  end-to-end  before  annotation.  Efforts  to 
dose  the  remaining  gaps  will  continue. 

Examination  of  the  sequences  of  chromosomes  2,  10,  11  and  14 
revealed  that  the  structure  of  these  chromosomes  was  similar  to  that 
of  the  other  chromosomes.  All  contained  the  97—99%  (A  +  T) 
putative  centromeric  sequences  reported  previously^.  Conserved 
subtelomeric  sequences^  were  observed  in  chromosomes  2, 10  and 
11,  but  most  of  these  elements  had  been  deleted  fi*om  both  ends  of 
chromosome  14.  The  termini  of  chromosome  14  consisted  of 
telomeric  hexamer  repeats  fused  directly  to  truncated  var  (variant 
antigen)  genes.  Deletions  of  this  type  are  thought  to  be  due  to 
chromosome  breakage  and  healing  events  that  occur  during  in  vitro 
cultivation  of  the  parasite. 

Annotation  procedures  have  improved  since  the  publication  of 
the  P.  falciparum  chromosome  2  sequence®.  A  gene  finding  program, 
phat  (pretty  handy  annotation  tool*),  was  developed,  supplement¬ 
ing  the  GlimmerM  program’  used  previously.  In  this  work,  Glim- 
merM  and  phat  were  retrained  on  a  larger  training  set  of  well- 


Table  1  Summary  statistics 
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Feature 


The  genome 

Size  (bp) 

No.  of  gaps 
Coverage* 

(G  +  C)  content  (%) 

No.  of  genes 

Mean  gene  length  {bp)t 

Gene  density  (bp  per  gene) 

Percent  coding 
Genes  with  introns  (%) 

Genes  vy^h  ESTs  (%) 

Gene  products  detected  by  proteomicst  (%) 
Exons 
Number 

Mean  no.  per  gene 
(G  +  C)  content  (%) 

Mean  length  (bp) 

Total  length  (bp) 

Introns 

Number 

(G  +  C)  content  (%) 

Mean  length  (bp) 

Total  length  (bp) 

Intergenic  regions 
(G  +  C)  content  (%) 

Mean  length  (bp) 

RNAs 

No.  of  tRNA  genes 
No.  of  5S  rRNA  genes 
No.  of  5.8S.  18S  and  28S  rRNA  units 
The  proteome 
Total  predicted  proteins 
Hypothetical  proteins^ 

InterPro  matches 
Pfam  matches 
Gene  Ontology 
Process 
Function 
Component 
Targeted  to  apicoplast 
Targeted  to  mitochondrion 
Structural  features 
Transrrtembrane  domain(s) 

Signal  peptide 
Signal  anchor 


Whole  genome 


22,853,764 

93 

14.5 

19.4 

5.268 
2,283.3 
4.338.2 

52.6 
53.9 

49.1 
51.8 

12.674 

2.4 

23.7 

949.1 

12.028.350 

7.406 

13.5 

178.7 
1,323,509 

13.6 
1,693.9 

43 

3 

7 

5.268 
3.208 
2,650 
1.746 

1.301 

1,244 

2,412 

551 

246 

1.631 

544 

367 


Chromosome  2 


947,102 

0 

11.1 

19.7 

223(209) 
2,079.1  (2,105.1) 
4,247.1  (4,531.6) 
49.0(46.5) 
57.0(43.1) 
46.2 
43.5 

510(353) 

2.3  (1.7) 

24.4  (24.3) 
909,1  (1.246.3) 
463.647(439.944) 

287  (144) 
13.4(13.4) 
202.4  (208.4) 
58.080(30.006) 

13.5(14.1) 

1,702.3(2,063.2) 

1 

0 

0 

223 

121 

116 

77 

63 

54 

120 

28 

10 

87 

28 

19 


Chromosome  10 


1.694,445 

4 

15.6 

19.7 
403 

2,085.8 

4,204.6 

49.6 

51.4 

48.1 

49.1 

892 

2.2 

24.5 
942.3 

840,576 

489 

13.6 

234.5 
114,676 

13.6 

1.678.5 

0 

0 

0 

403 

265 

210 

133 

89 

74 

181 

36 

13 

133 

41 

32 


Chromosome  1 1 


2,035,250 

3 

11.3 
19.0 
492 

2.127.7 

4.136.7 

51.4 

50.4 

48.4 
51.0 

1,094 

2.2 

23.5 
956.9 

1,046,814 

602 

13.7 

189.4 

114,012 

14.1 

1,768.5 

2 

0 

1 

492 

339 

283 

184 

110 

95 

220 

52 

17 

141 

52 

31 


Chromosome  14 


3,291,006 

3 

9.2 

18.4 
769 

2.315.1 
4,279.6 

54.1 

49.9 

46.9 

52.1 

1,757 

2.3 
22.8 

1,013.3 

1.780.305 

988 

13.5 

185.5 
183,240 

13.2 

1.717.2 

2 

3 

0 

769 

485 

455 

275 

168 

174 

308 

73 

33 

202 

63 

51 


Numbers  in  parentheses  under  chromosome  2  indicate  values  obtained  in  the  previous  annotation*.  Specialized  searches  used  the  following  programs  and  databases:  InterPro^',  Pfam'«  and  Gene 
Ontology^.  Predictions  of  apicoplast  and  mitochondrial  targeting  were  performed  using  TargetP*  and  MitoProtiP;  transmembrane  domains.  TMHMM"*;  and  signal  peptides  and  signal  anchors. 


SlgnalP-2.0(ref.  23). 

•Average  number  of  sequence  reads  per  nucleotide.  EST,  expressed  sequence  tag. 


t  Excluding  introns. 

i  Percent  of  proteins  detected  in  parasite  extracts  by  two  independent  proteomic  analyses”'^. 

§Hypothetica!  proteins  are  proteins  with  insufficient  similarity  to  characterized  proteins  In  other  organisms  to  justify  provision  of  functional  assignments. 
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characterized  gen^,  complementary  DNAs  (cDNAs)  and  products 
of  PCR  with  reverse  transcription  (RT-PCR)  (total  length  540  kb) 
than  was  used  in  the  earlier  work.  A  program  called  Combiner  was 
used  to  evaluate  the  GlimmerM  and  phat  predictions,  as  well  as  the 
results  of  searches  against  nucleotide  and  protein  databases,  to 
construct  consensus  gene  models.  To  assess  the  effect  of  these 
modifications,  chromosome  2  was  re-annotated  and  the  results 
were  compared  with  the  previous  annotation. 

Application  of  these  automated  annotation  procedures  and 
manual  curation  of  the  resulting  gene  models  for  chromosome  2 
produced  223  gene  models.  The  revised  procedures  detected  21 
genes  not  predicted  previously,  and  13  of  the  existing  chromosome 
2  models  collapsed  into  six  models  in  the  new  annotation.  Of  the  21 
new  gene  models,  all  but  one  had  no  significant  similarity  to 
proteins  in  a  non-redundant  amino-acid  database.  However,  at 
least  a  portion  of  each  of  the  21  gene  models  had  been  predicted 
independently  by  both  GlimmerM  and  phat,  suggesting  that  many 
of  these  models  were  likely  to  represent  coding  sequences.  On 
the  other  hand,  five  of  the  new  gene  models  encoded  proteins  less 
than  100  amino  acids  in  length,  and  may  be  less  likely  to  encode 
proteins. 

Another  major  difference  was  the  detection  of  additional  small 
exons.  In  the  earlier  annotation  of  chromosome  2,  the  209  predicted 
genes  contained  353  exons,  or  an  average  of  1.7  exons  per  gene.  The 
revised  procedures  reported  here  revealed  510  exons,  or  2.3  exons 
per  gene;  60%  of  the  new  exons  were  predicted  to  be  additions  to  the 
gene  models  reported  previously.  Most  cases  involved  the  addition 
of  one  or  two  exons  per  gene.  In  three  notable  cases,  however,  7  to  12 
small  exons  were  added  to  the  earlier  gene  models,  and  almost  all  of 
the  new  exons  had  been  predicted  by  both  of  the  gene  finding 
programs.  Overall,  use  of  the  revised  annotation  procedures 
resulted  in  the  detection  of  additional  genes  and  many  small 
exons,  vdiich  is  reflected  in  the  higher  gene  density  and  shorter 
mean  exon  length  in  the  newly  annotated  chromosome  2  sequence 
compared  with  the  previous  annotation  (Table  1).  Despite  these 
improvements  in  software  and  training  sets,  gene  finding  in 
P.  falciparum  remains  challenging,  and  the  gene  structures  pre¬ 
sented  here  should  be  regarded  as  preliminary  until  confirmed  by 
sequence  information  obtained  from  cDNAs  or  RT— PCR  experi¬ 
ments^®.  Accurate  prediction  of  the  5'  ends  of  genes  is  particularly 
difficult.  Generation  of  larger  training  sets,  including  additional 
expressed  sequence  tags  (ESTs)  and  full-length  cDNAs,  would 
greatly  improve  the  sensitivity  and  accuracy  of  gene  predictions. 

These  annotation  procedures  were  also  applied  to  the  analysis  of 
chromosomes  10, 11  and  14  (Table  1;  maps  of  these  chromosomes 
are  available  as  Supplementary  Information).  The  10  short  gaps  in 
the  chromosomes  should  not  have  interfered  with  the  gene  predic¬ 
tions;  only  the  genes  adjacent  to  the  gaps  might  have  been  affected. 
All  three  chromosomes  were  similar  in  terms  of  gene  density,  coding 
percentage  and  other  parameters.  A  complete  description  of  the 
parasite  genome  is  contained  in  the  accompanying  Article^. 

Annotation  of  chromosomes  10, 11  and  14  revealed  four  proteins 
with  sequence  similarity  to  SR  proteins,  a  family  of  conserved 
splicing  factors  that  contain  RNA-binding  domains  and  a  protein 
interaction  domain  rich  in  Ser  and  Arg  residues  (SR  domain; 
PF10_0047,  PF10_0217,  PF11_0200,  PF14_0656).  Three  additional 
putative  SR  proteins  were  identified  on  chromosomes  5  and  13 
(PFE0160C,  PFE0865C,  MAL13P1.120).  SR  proteins  are  thought  to 
bind  to  exonic  splicing  enhancers  (ESEs),  short  (6-9  bp)  sequences 
within  exons  that  assist  in  the  recognition  of  nearby  splice  sites,  and 
to  interact  with  components  of  the  spliceosome**.  ESEs  have 
previously  been  characterized  only  in  multicellular  organisms.  To 
determine  whether  P.  falciparum  may  use  ESEs  as  part  of  its  splicing 
machinery,  a  Gibbs  sampling  algorithm  for  motif  detection^^  was 
applied  to  a  set  of  P.  falciparum  exons  to  detect  any  exonic  splicing 
enhancers  (ESEs).  The  exons  were  extracted  from  the  set  of  well- 
characterized  genes  used  to  train  the  GlimmerM  gene  finder. 


Regions  of  50  bp  regions  were  selected  from  both  ends  of  the 
internal  exons  and  divided  into  two  different  data  sets,  representing 
the  exon  regions  adjacent  to  both  5^  and  splice  sites.  At  least  10 
runs  of  the  Gibbs  sampler  were  performed  for  each  data  set  in  order 
to  identify  the  most  probable  motif  with  a  length  of  5-9  nucleotides. 
The  motif  with  the  highest  maximum  a  posteriori  probability  was 
retained.  This  analysis  identified  a  motif  with  the  consensus 
GAAGAA,  which  is  identical  to  ESEs  found  in  human  exons*^  *^ 
The  identification  of  several  putative  SR  proteins,  and  sequences 
identical  to  the  ESEs  in  humans,  suggests  that  some  features  of  exon 
recognition  and  splicing  observed  in  higher  eukaryotes  may  be 
conserved  in  P.  falciparum,  □ 

Methods 

Sequencing  and  closure 

P.  falciparum  clone  3D7  was  selected  for  sequencing  because  it  can  complete  all  phases  of 
the  life  cycle,  and  had  been  used  in  a  genetic  cross"  and  the  Wellcome  Trust  Malaria 
Genome  Mapping  Project".  High-molecular-mass  genomic  DNA  was  subjected  to 
electrophoresis  on  preparative  pulsed  field  gels,  and  chromosomes  were  excised.  DNA  was 
extracted  from  the  gel,  sheared,  and  cloned  into  the  pUClS  vector  as  described® 
(chromosomes  2, 14)  or  into  a  modified  pUC18  vector  via  BsfXI  linkers  (chromosomes  10, 
11).  Sequences  were  assembled  and  gaps  were  closed  by  primer  walking  on  plasmid  DNAs 
or  genomic  PCR  products,  or  by  transposon  insertion®.  Ordering  of  contigs  was  facilitated 
by  the  use  of  sequence  ta^ed  sites"  and  micrbsatellite  markers".  The  final  assembly  of 
each  chromosome  was  verified  by  comparison  with  BamHl  and  Nhel  optical  restriction 
maps".  The  average  difference  in  size  between  the  experimentally  determined  restriction 
fragments  and  the  fragments  predicted  from  the  sequence  was  approximately  5-6%  for 
chromosomes  11  and  14  for  both  enzymes.  For  chromosome  10,  the  average  difference  in 
fragment  sizes  was  6.1%  for  the  Nhel  map,  but  the  BamHl  optical  and  prediction  restriction 
maps  could  not  be  aligned.  Because  the  Nhel  optical  restriction  map  agreed  with  that 
predicted  from  the  sequence,  the  chromosome  10  assembly  was  judged  to  be  correct. 

Annotation 

GlimmerM’  and  phat*  were  trained  on  117  P.  falciparum  genes  and  39  cDNAs  taken  from 
GenBank,  plus  32  genes  from  chromosomes  2  and  3  that  had  been  verified  by  RT-PCR 
(provided  by  R.  Huestis  and  K.  Fischer;  the  training  set  is  available  at  http://www.tigT.org/ 
software/glim  merm/data).  The  GlimmerM  and  phat  predictions,  and  sequence 
alignments  of  the  chromosomes  to  protein  and  cDNA  databases,  were  evaluated  by  the 
Combiner  program.  The  program  used  a  linear  weighting  method  and  dynamic 
programming  to  construct  consensus  gene  models  that  were  curated  manually  using 
AnnotationStation  (AfiyMetrix  Inc.).  Predicted  proteins  were  searched  against  a  non- 
redundant  amino- acid  database  using  BLASTP;  other  features  were  identified  by  searches 
against  the  Pfam",  PROSITE*  and  InterPro^  databases.  The  results  of  all  analyses  were 
reviewed  using  Manatee,  a  tool  that  interfaces  with  a  relational  database  of  the  information 
produced  by  the  annotation  software.  Predicted  gene  products  were  manually  assigned 
Gene  Ontology  “  terms.  Signal  peptides  and  signal  anchors  were  predicted  with 
SignalP-2.0  (ref.  23).  Transmembrane  helices  were  predicted  with  TMHMM”. 
Mitochondrial-  and  apicoplast-targeted  proteins  were  predicted  by  MitoProtiP, 
TargetP"  and  PATS".  tRNA-ScanSE*  was  used  to  identify  transfer  RNAs. 
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Species  of  malaria  parasite  that  infect  rodents  have  long  been  used  as  models  for  malaria  disease  research.  Here  we  report  the 
whole-genome  shotgun  sequence  of  one  species,  Plasmodium  yoeliiyoelii,  and  comparative  studies  with  the  genome  of  the  human 
malaria  parasite  Plasmodium  latelparum  clone  307.  A  synteny  map  of  2,212  P.  y.  yoe/// contiguous  DNA  sequences  (contigs) 
aligned  to  14  P.  fe/c/pan/m  chromosomes  reveals  marked  conservation  of  gene  synteny  within  the  body  of  each  chromosome.  Of 
about 5,300  P.  falcipanun  genes,  more  than  3,300  P.  y.  yoe///orthologues  of  predominantly  metabolic  function  were  identified.  Over 
800  copies  of  a  variant  antigen  gene  located  in  subtelomeric  regions  were  found.  This  is  the  first  genome  ^quence  of  a  model 
eukaryotic  parasite,  and  it  provides  insight  into  the  use  of  such  systems  in  the  modelling  of  Plasmo(Hum  biology  and  disease. 


For  decades,  the  laboratory  mouse  has  provided  an  alternative 
platform  for  infectious  disease  research  where  the  pathogen  under 
study  is  intractable  to  routine  laboratory  manipulation.  Experimen¬ 
tal  study  of  the  human  malaria  parasite  Plasmodium  falciparum  is 
particularly  problematic  as  the  complete  life  cycle  cannot  be  main¬ 
tained  in  Wtro.  Four  species  of  rodent  malaria  {Plasmodium  yoelii, 
Plasmodium  berghei,  Plasmodium  chabaudi  and  Plasmodium 
vinckei)  isolated  from  wild  thicket  rats  in  Africa  have  been  adapted 
to  grow  in  laboratory  rodents*.  These  species  reproduce  many  of  the 
biological  characteristics  of  the  human  malaria  parasite.  Many  of 
the  ej^erimental  procedures  refined  for  use  with  P.  falciparum  were 
initially  developed  for  rodent  malaria  species,  a  prime  example 
being  stable  genetic  transformation^.  Thus  rodent  models  of  malaria 
have  been  used  widely  and  successfully  to  complement  research  on 
P.  falciparum. 

With  the  advent  of  the  P.  falciparum  Genome  Sequencing  Project, 
undertaken  by  an  international  consortium  of  genome  sequencing 
centres  and  malaria  researchers,  a  series  of  initiatives  has  begun  to 
generate  substantial  genome  information  from  additional  Plasmo¬ 
dium  species^.  We  describe  here  the  genome  sequence  of  the  rodent 
malaria  parasite  P.  y.  yoelii  to  fivefold  genome  coverage.  We  show 
that  this  partial  genome  sequencing  approach,  although  limited  in 
its  application  to  the  study  of  genome  structure,  has  proved  to  be  an 
effective  means  of  gene  discovery  and  of  jump-starting  experimen¬ 
tal  studies  in  a  model  Plasmodium  species.  Furthermore,  we  show 
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that  despite  the  considerable  divergence  between  the  P.  y.  yoelii  and 
P.  falciparum  genomes,  sequencing  and  annotation  of  the  former 
can  substantially  improve  the  accuracy  and  efficiency  of  annotation 
of  the  latter. 

Plasmodium  yoelii  yoelii  genome  sequencing  and  annotation 

We  applied  the  whole-genome  shotgun  (WGS)  sequencing 
approach,  used  successfully  to  sequence  and  assemble  the  first 
large  eukaryotic  genome'*,  to  achieve  fivefold  sequence  coverage  of 
the  genome  of  a  clone  of  the  17XNL  line  of  P.  y.  yoelii  (Table  1).  This 
level  of  coverage  is  expected  to  comprise  99%  of  the  genome^ 
assuming  random  library  representation.  As  with  P.  falciparum, 
the  genomes  of  rodent  malaria  parasites  are  highly  (A  -f  T)-rich*, 
which  adversely  affects  DNA  stability  in  plasmid  libraries.  Conse¬ 
quently,  all  '^220,000  reads  were  produced  from  clones  originating 


Tafcle  1  fHasmodium  yoetii  yoelii  genome  coverage  statistics 


Data 

Component 

Value 

Genome 

No.  of  contigs 

6,687 

Mean  contig  ^e  |kb) 

3.6 

Max.  contig  size  (kb) 

51.5 

Cumulative  contig  length  (Mb) 

23.1 

No.  of  singletons 

11.732 

No.  of  groups 

2,906 

Max.  group  size  (kb) 

69.8 

Cumulative  group  ^e  (Mb) 

21.6 

Transcript  ome 

No.  of  ESTs 

13,080 

Average  length  (nucleotides) 

497 

Pfoteome 

No,  of  gametoc^e  peptides 

1,413 

No.  of  sporozoite  peptides 

677 

512 
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from  small  (2-3  kilobases  (kb))  insert  libraries.  Contigs  were 
assembled  using  TIGR  Assembler^  Contaminating  mouse 
sequences,  identified  through  similarity  searches  and  found  to 
comprise  10%  of  the  total  sequence  data,  were  excluded  from  the 
analyses.  Approximately  three-quarters  of  the  contigs  could  be 
placed  into  2,906  ‘groups’,  each  group  consisting  of  two  or  more 
contigs  known  to  be  linked  through  paired  reads  as  determined  by 
Grouper  software^  This  produced  an  average  group  size  of  7.4  kb, 
approximately  4  kb  more  than  the  average  contig  size.  This  group 
size  is  small  compared  with  the  group  data  produced  by  other 
partial  eukaryotic  genome  projects,  where  extensive  use  of  large 
insert  (linking)  libraries  has  enabled  the  construction  of  ordered 
and  orientated  ‘scaffolds’®,  and  emphasizes  the  use  of  such  linking 
libraries  in  partial  genome  projects.  The  genome  size  of  P.  y.  yoelii  is 
estimated  to  be  23  megabases  (Mb),  in  agreement  with  karyotype 
data^. 

Expression  data  from  the  P  y.  yoelii  transcriptome  and  proteome 
were  generated  to  aid  in  gene  identification  and  annotation  of  the 
contigs  (Table  1).  A  total  of  13,080  expressed  sequence  tag  (EST) 
sequences  generated  from  clones  of  an  asexual  blood-stage  P  y. 
yoelii  complementary  DNA  library^®,  in  combination  with  other  P 
yoelii  ESTs  and  transcript  sequences  available  from  public  databases, 
were  assembled  and  used  to  compile  a  gene  index^^  of  expressed  P  y. 
yoelii  sequences  (http://www.tigr.org/tdb/tgi/pygi/).  For  protein 
expression  data,  multidimensional  protein  identification  technol¬ 
ogy  (MudPIT),  which  combines  high-resolution  liquid  chromatog¬ 
raphy  with  tandem  mass  spectrometry  and  database  searching,  was 
applied  to  the  gametocyte  and  salivary  gland  sporozoite  proteomes 
of  P  y.  yoelii,  A  total  of  1,413  gametocyte  and  677  sporozoite 
peptides  were  recorded  and  used  for  the  purposes  of  gene 
annotation. 

We  used  two  gene-finding  programs,  GlimmerMExon  and 
Phat^^,  to  predict  coding  regions  in  P  y.  yoelii.  GlimmerMExon  is 
based  on  the  eukaryotic  gene  finder  GlimmerM^^,  with  modifi¬ 
cations  developed  for  analysing  the  short  fragments  of  DNA  that 
result  from  partial  shotgun  sequencing.  Gene  models  based  on 
GlimmerMExon  and  Phat  predictions  were  refined  using  Combi- 


Table  2  Comparison  of  genome  features  of  P.  falciparum  and  P.  y.  yoelii 


Feature 

P.  y.  yoelii 

P.  falciparum 

Size  (Mb) 

23.1 

22.9 

No.  of  chromosomes 

14 

14 

No.  of  gaps 

5,812 

93 

Coverage* 

5 

14.5 

(G  +  C)  content  (%) 

22.6 

19.4 

No.  of  genest 

5,878 

5,268 

Mean  gene  length  (bp) 

1,298 

2,283 

Gene  density  (bp  per  gene) 

2,566 

4,338 

Per  cent  coding 

50.6 

52.6 

Genes  \A^h  introns  (%) 

54.2 

53.9 

Genes  with  ESTs  (%) 

48.9 

49.1 

Gene  products  detected  by  proteomics  (%) 

18.2 

51 .8 

Exons 

Mean  no.  per  gene 

2.0 

ZA 

(G  +  C)  content  (%) 

24.8 

23.7 

Mean  length  (bp) 

641 

949 

Introns 

(G  +  C)  content  (%) 

21.1 

13.5 

Mean  length  (bp) 

209 

179 

Total  length  (bp) 

1 ,687,689 

1 ,323,509 

Intergenic  regions 

(G  -H  C)  content  (%) 

20.7 

13.D 

Mean  length  (bp) 

859 

1,694 

RNAs 

No.  oftRNA  genest 

39 

43 

No.  of  5S  rRNA  genes 

3 

3 

No.  of  5.8S,  1 8S  and  28S  rRNA  units 

4 

7 

Mitochondrial  genome 

(G  +  C)  content  (%) 

31 

31 

Apicoplast  genome 

(G  +  C)  content  (%) 

15 

14 

‘Average  number  of  sequence  reads  per  nucleotide, 
fTotal  number  of  full-length  genes. 

tThe  smaller  number  reflect  the  partial  nature  of  the  P.  y.  yoelii  genome  data. 


ner.  Annotation  of  predicted  gene  models  used  TIGR’s  fully  auto¬ 
mated  Eukaryotic  Genome  Control  suite  of  programs.  Gene  finding 
and  subsequent  annotation  were  limited  to  2,960  contigs  (each  of 
which  is  over  2  kb  in  size),  a  subset  of  sequences  that  contains  more 
than  20  Mb  of  the  genome.  A  total  of  5,878  complete  genes  and 
1,952  partial  genes  (defined  as  genes  lacking  either  an  annotated 
start  or  stop  codon)  can  be  predicted  from  the  nuclear  genome  data. 

Comparative  genome  analysis 

A  comparison  of  several  genome  features  of  P  falciparum  and  P  y. 
yoelii  is  shown  in  Table  2,  demonstrating  that  many  similarities  exist 
between  the  genomes.  Besides  the  similarly  extreme  (G  -F  C) 
compositions,  both  genomes  contain  a  comparable  number  of 
predicted  full-length  genes,  with  the  higher  figure  in  P  y.  yoelii 
due  to  an  extremely  high  copy  number  of  variant  antigen  genes  (see 
below).  Where  differences  between  the  genomes  do  exist,  such  as  the 
(G  +  C)  content  of  the  coding  portion  of  the  genomes,  incomplete¬ 
ness  of  the  P  y.  yoelii  genome  data,  with  the  associated  problems  of 
accurate  gene  finding  in  both  species,  is  likely  to  be  a  confounding 
factor.  As  an  indication  of  this  problem,  analysis  of  P  y.  yoelii 
proteomic  data  identified  83  regions  of  the  genome  apparently 
expressed  during  sporozoite  and/or  gametocyte  stages  but  not 
assigned  to  a  P  y.  yoelii  gene  model  (data  not  shown).  Many  of 
these  peptide  hits  appear  sufficiently  close  to  a  model  as  to  indicate  a 
fault  with  gene  boundary  prediction  rather  than  a  lack  of  gene 
prediction  per  se.  However,  as  with  the  gene  model  prediction  in  P 
falciparum,  the  gene  models  of  P  y.  yoelii  should  be  considered 
preliminary  and  under  revision. 

Identifying  orthologues  of  P  falciparum  vsiccme  candidate  pro¬ 
teins  and  proteins  that  are  either  targets  of  antimalarial  drugs  or 
involved  in  antimalarial  drug  resistance  mechanisms  is  a  primary 
goal  of  model  malaria  parasite  genomics.  Using  BLASTP^^  with  a 
cutoff  E  value  of  10“^^  and  no  low-complexity  filtering,  3,310  bi¬ 
directional  orthologues  (defined  as  genes  related  to  each  other 
through  vertical  evolutionary  descent)  can  be  identified  in  the  full 
protein  complement  of  P  falciparum  (5,268  proteins)  and  the 
protein  complement  of  P  y.  yoelii  translated  from  complete  gene 
models  (5,878  proteins) .  A  list  of  vaccine  candidate  orthologues  and 
orthologues  of  genes  involved  in  antimalarial  drug  interactions 
identified  from  among  the  3,310  orthologues  and  from  additional 
BLAST  analyses  is  shown  in  Table  3.  Those  genes  that  are  not 
identifiable  may  either  be  absent  from  the  partial  genome  data,  or 
represent  genes  that  have  been  lost  or  diverged  sufficiently  that  they 
are  undetectable  through  similarity  searching. 

Many  of  the  candidate  vaccine  antigens  under  study  in  P 
falciparum  can  be  identified  in  P  y.  yoelii,  including  orthologues 
of  several  asexual  blood-stage  antigens  known  to  elicit  immune 
responses  in  individuals  exposed  to  natural  infection  (MSPl, 
AMAl,  RAPl,  RAP2).  As  immunity  to  P  falciparum  blood-stage 
infection  can  be  transferred  by  immune  sera,  identification  of  the 
targets  of  potentially  protective  antibody  responses  after  natural 
infection  can  provide  information  beneficial  to  the  selection  of 
candidate  antigens  for  malaria  vaccines.  We  found  several  ortho¬ 
logues  of  known  P  falciparum  transmission-blocking  candidates; 
in  particular,  members  of  the  P48/45  gene  family  identified 
previously^  ^  were  confirmed. 

We  identified  several  P  y.  yoelii  orthologues  of  P  falciparum 
biochemical  pathway  components  under  study  as  targets  for  drug 
design  (Table  3),  most  notably:  (1)  the  1-deoxy-D-xylulose  5- 
phosphate  reductoisomerase  (DOXPR)  gene  whose  product  is 
inhibited  by  fosmidomycin  in  P  falciparum  in  vitro  cultures  and 
mice  infected  with  P  vinckei^^;  (2)  enoyl-acyl  carrier  protein  (AGP) 
reductase  (FABI)  whose  product  is  inhibited  by  triclosan  in  P 
falciparum  in  vitro  cultures  and  mice  infected  with  P  herghei^'^; 
and  (3)  a  gene  encoding  farnesyl  transferase  (FTASE),  which  is 
inhibited  in  cultures  of  P  falciparum  treated  with  custom-designed 
peptidomimetics^®.  The  rodent  models  of  malaria  have  proved 
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Table  3  P.  y.  yoelll  orthologues  of  P.  falciparum  candidate  vaccine  and  drug  Interaction  genes  - 

P.faldparvmgene  ^chromosome  ST  location* 

Candidate  vaccine  antigens 

Ring-Infected  erythrocytic  surface  antigen  1 ,  resa  1  1  Yes 

Merozoite  surface  protein  4,  msp4  2  N  ^ 

Merozoite  surface  protein  5,  msp5  2  N 

Liver  stage  antigen  3,  Isa3  2  N 

Merozoite  surface  protein  2,  Isa3  2  N 

Transmission-blocking  target  antigen  230,  Pfe230  2  No 

O'rcumsporozoite  protein,  csp  3 

Rhoptry-assodated  protein  2,  rap2  5  Y  ° 

Sporozoite  surface  antigen,  starp  y  ^ 

Morozoite  surface  protein  1 ,  mspi  g 

Liver  stage  antigen  1 ,  /sa  7  ^  q 

Merozoite  surface  protein  3,  msp3  1  q 

Glutamate-rich  protein,  g/urp  1q 

Ookinete  surface  protein  25,  Pfi525  10  N 

Ookinete  surface  protein  28,  Pfs28  1  q  ^ 

Erythrocyte  membrane-associated  332  antigen.  Pf332  1 1 

Apical  membrane  antigen  1 ,  ama  1  *11  ^ 

Exported  protein  1 ,  exp  7 

Surface  sporozoite  protein  2,  ssp2  13 

Sexual-stage-specific  surface  antigen  48/45,  13 

Rhoptry-associated  protein  1 ,  7  14  Y  ° 

Candidate  drug  interaction  genes 

Dihydrofolate  reductase,  dhfir  4  ^ 

Multidrug  resistance  protein  1 ,  pfmdri  5  ^ 

Translationally  controlled  tumour  protein,  tctp  5 

Famesyl  transferase,  ftase  5 

Enoyf-acyl  carrier  reductase,  feb/  5  N 

Dihydro-protate  dehydrogenase,  d/joof  q 

Chloroquine-resistance  transporter,  pfcrt  7 

Dihydropteroate  synthase,  dhps  3 

Lactate  dehydrogenase, /dh  13  ° 

DOXP  reductoisomerase,  doxpr  1 4 

of^ues  can  be  found  as  Table  A  in  the  Supplem^t^  Infomato^  . 

d'  ^  ^  distance  from  the  centre  to  the  end  of  the  P.  falciparum  chromZome. 

t  Homologue  of  P.  f^tparum  msp4  and  msp5  genes  found  as  a  single  gene  msp4/5  in  P  y.  yoelli  and  other  rodent  malaria  species^ 


ST  location* 

FY  locus 

FV  locus 

Yes 

PFAOIlOw 

Not  identified 

No 

PFB0310C 

PY07543t 

No 

PFB0305C 

PY07543t 

No 

PFB0915W 

Not  identified 

No 

PFB0300C 

Not  identified 

No 

PFB0405W 

PY03856 

No 

MAL3P2.11 

PY03168 

Yes 

PFE0080C 

PY03918 

Yes 

PF07_0CX)6 

Not  Identified 

No 

PFI475W 

PY05748 

No 

PF10_0356 

Not  identified 

No 

PF10_0345 

Not  identified 

No 

PF10_0344 

Not  Identified 

No 

PF10_0303 

PY00523 

No 

PF10_0302 

PY00522 

No 

PF1 1^0507 

PY06496 

No 

PF11_0344 

PY01581 

No 

PF11^0224 

Not  identified 

No 

PF13_0201 

PY03052 

No 

PF13_0247 

PY04207 

Yes 

PF14_0637 

PY00622 

No 

PFD0830W 

PY04370 

No 

PFE1150W 

PY00245 

No 

PFE0545C 

PY04896 

No 

PFE0970W 

PY06214 

No 

MAL6P1.275 

PY03846 

No 

MAL6P1.36 

PY02580 

No 

MAL7P1.27 

PY05061 

No 

PF08  0095 

PY02226 

No 

PF13  0141 

PY03885 

No 

PF14  0641 

PY05578 

invaluable  both  for  the  study  of  potency  of  new  antimalarial 
conripounds  in  vivo,  and  for  the  elucidation  of  mechanisms  of 
antimalarial  drug  resistance. 

We  applied  the  Gene  Ontology  (GO)  gene  classification  system‘^ 
which  uses  a  controlled  vocabulary  to  describe  genes  and  their 
function,  to  indicate  which  classes  of  gene  among  the  3,310 
orthologues  might  differ  in  number  between  P.  falciparum  and  P. 
y.  yoelii  (Fig.  1).  A  similar  proportion  of  proteins  were  identified  for 
most  of  the  GO  classes  between  the  two  species,  with  the  caveat  that 
fewer  total  numbers  of  proteins  were  identified  in  P.  y  yoelii  owing 
to  the  partial  nature  of  the  genome  data  for  this  species.  However, 
proteins  allocated  to  the  physiological  processes,  cell  invasion  and 
adhesion,  and  cell  communication  categories  were  significantly 
reduced  in  P.  y  yoeliu  These  classes  contain  members  of  three 
multigene  families  whose  genes  are  found  predominantly  in  the 
subtelomeric  regions  ofP.  falciparum  chromosomes:  PfEMPl,  the 
protein  product  of  the  var  gene  family  known  to  be  involved  in 
antigenic  variation,  cyto-adherence  and  rosetting,  and  rifins  and 
stevors,  which  are  clonally  variant  proteins  possibly  involved  in 
antigenic  variation  and  evasion  of  immune  responses  (reviewed  in 
ref.  20).  Apparently,  P.  falciparum  has  generated  species-specific, 
subtelomeric  genes  involved  in  host  cell  invasion,  adhesion  and 
antigenic  variation,  homologues  of  which  are  not  found  in  the  P.  y. 

I  yoe/«  genome. 

Gene  families  of  unique  interest  in  the  P.  y.  yoelii  genome 

The  largest  family  of  genes  identified  in  the  P.  y,  yoelii  genome  is  the 
yir  gene  family,  homologues  of  the  vir  multigene  family  recentiy 
described  in  the  human  malaria  parasite  Plasmodium  vivoj^^  and  in 
other  species  of  rodent  malaria^^.  In  P.  vivax,  an  estimated  600— 
1,000  copies  of  the  subtelomerically  located  vir  gene  encode 
proteins  that  are  immunovariant  in  natural  infections,  indicating 
a  possible  functional  role  in  antigenic  variation  and  immune 
evasion.  Within  the  P.  y  yoelii  genome  data,  838  yir  genes  (693 


full  genes  and  145  partial  genes)  are  present  (Table  4;  see  also 
Supplementary  Figs  A  and  B).  Almost  75%  of  the  annotated  contigs 
identified  as  containing  subtelomeric  sequences  (see  below)  contain 
yir  genes,  many  arranged  in  a  head-to-tail  fashion.  Expression  data 
indicate  that  yir  genes  are  expressed  during  sporozoite,  gametocyte 
and  erythrocytic  stages  of  the  parasite,  similar  to  the  expression 
pattern  seen  with  P.  falciparum  var  and  rif  genes^^.  Preliminary 


5  600 

5 

E  400 

3 
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Figure  1  Functional  classification  comparison  between  P  falciparum  P,  y.  yoelii 
proteins.  We  compared  the  GO  terms  of  proteins  assigned  to  'biological  process'  for  the 
orthologous  genes  identified  between  the  two  species.  The  process  group  contains  3,041 
P.  falciparum  annotations  (filled  bars),  and  2,161  reciprocal  annotations  are  shown  for 
P.  y.  yoelii  (open  bars).  Ten  GO  classes  with  similar  numbers  of  P.  falciparum  and  P.  y. 
yoe/// proteins  in  each  are  assigned  as  ‘miscellaneous’;  that  is,  cell  cycle,  external 
stimulus  response,  stress  response,  signal  transduction,  homeostasis,  developmental 
processes,  cell  proliferation,  membrane  fusion,  death,  cell  motility. 
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Table  4  Paralogous  gene  families  in  P.  y.  yoelii 

Gene  family  No.  Name 

HMM  ID 

Location  in  Py 

Py  expression* 

Pf  locus 

TM/SPt 

yirlbirldr 

235kDa 

838 

14 

Variant  antigen  family 
Reticulocyte  binding  family 

TiGR01590 

TIGR01612 

Subtelomeric 

Subtelomeric 

Gmt,  spz,  bs 
Gmt,  spz,  bs 

None 

PFD01 1 0w,  MALI  3P1 .1 76, 
PF13  0198,  PFL2520W, 

P/A 

P/A 

PFDOIlOw 

pyst-a 

pyst-b 

pyst-c 

pyst-d 

etramp 

168 

57 

21 

17 

11 

Hypothetical 

Hypothetical 

Hypothetical 

Hypothetical 

Early  transcribed 
membrane  protein  family 

T1GR01599 

TIGR01597 

T1GR01601,TIGR01604 

T1GR01605 

TIGR01495 

Subtelomeric 

Subtelomeric 

Subtelomeric 

Subtelomeric 

Subtelomeric 

Gmt,  spz 

Bs 

Bs 

Gmt 

Gmt,  spz,  bs 

PF14_0604 

None 

None 

None 

PF13_0012,  PF14_0016, 
PF11_0040,  PFB0120W, 
PF10_0323,  MALI  2P1 .387, 

A/A 

P/A 

P/P 

P/P 

P/P 

PF11_0039,  PFL1095C, 
PF10_0019,  PF1745C, 
PFE1590W,  PF10_0164, 
MAL8P1.6.  PFA0195W. 
PFL0065W,  PF14_0729 

pst-a 

12 

Hydrolase  family 

TIGR01607 

Subtelomeric 

Gmt,  spz 

PFL2530W,  PF10_0379, 
PF14_0738,  PF14_0017, 
PF14_0737,  PF1800W. 
PF11775W,  PF07_0040, 
PF07_0005,  PFA0120C 

A/A 

rhoph1/dag 

2 

Rhoptry  H1/  cyto-adherence- 
linked  asexual  gene  family 

PF03805 

Subtelomeric 

Gmt,  bs 

PFCOIlOw,  PFC0120W, 
PFI1730W,  PFI1710W, 
PFB0935W 

A/P 

regarding  TM  and  SP  prediction  algorithms.) 


results  using  antibodies  developed  against  the  conserved  regions  of 
the  protein  have  confirmed  protein  localization  at  the  surface  of  the 
infected  red  blood  cell  (D.A.C.  et  al,  manuscript  in  preparation). 
The  number  of  gene  copies  in  the  R  y.  yoelii  genome,  the  localization 
and  stage-specific  expression  of  gene  members,  as  well  as  the 
existence  of  homologues  in  other  Plustnodiutn  species,  make  this 
gene  family  a  prime  target  for  the  study  of  mechanisms  of  immune 
evasion. 

A  maximum  of  14  members  of  the  Ry235  multigene  family  can  be 
identified  among  the  P.  y.  yoelii  protein  data  (Table  4).  This  family 
expresses  proteins  that  localize  to  rhoptries  (organelles  that  contain 
proteins  involved  in  parasite  recognition  and  invasion  of  host  red 
blood  cells).  Py235  genes  exhibit  a  newly  discovered  form  of  clonal 
antigenic  variation,  whereby  each  individual  merozoite  derived 
from  a  single  parent  schizont  has  the  propensity  to  express  a 
different  Py235  protein^^  Closely  related  homologues  of  the 
Py235  gene  family  have  been  found  in  other  rodent  malaria  species, 
and  more  distantly  related  homologues  have  been  found  in  R 
vivaoP^  and  R  falciparutrP^.  The  gene  copy  number  identified  in 
the  current  data  set  is  less  than  has  been  predicted  in  other  R  y.  yoelii 
lines  (30-50  per  genome).  This  could  reflect  real  differences  in  copy 
number  between  lines,  but  more  probably  suggests  an  error  in  the 
original  estimate  or  misassembly  of  extremely  closely  related 
sequences.  Almost  all  of  the  Py235  genes  are  found  on  contigs 
identified  as  subtelomeric  in  the  P.  y.  yoelii  genome  (see  Supplemen¬ 
tary  Fig.  C). 

Four  further  paralogous  gene  families,  pyst-a  to  -d,  are  specific  to 
P.  y.  yoelii  (Table  4).  The  pyst-a  family  deserves  mention,  as  it  is 
homologous  to  a  P.  chabaudi  glutamate-rich  protein^^  and  to  a 
single  hypothetical  gene  on  P.  falciparum  chromosome  14, 
suggesting  expansion  of  this  family  in  the  rodent  malaria  species 
from  a  common  ancestral  Plasmodium  gene.  Two  paralogous  gene 
families  containing  multiple  members  are  homologous  to  multi¬ 
gene  families  identified  in  P.  falciparum.  Gene  members  of  one 
family,  etramp  (early  transcribed  membrane  protein),  have  pre¬ 
viously  been  identified  in  P.  falciparutrP^  and  in  P.  chabaudi  where  a 
single  member  has  been  identified  and  localized  to  the  parasito- 
phorous  vacuole  membrane^^. 

Telomeres  and  chromosomal  exchange  in  subtelomeric  regions 

The  telomeric  repeat  in  P.  y.  yoelii  is  AACCCTG,  which  differs  from 
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the  P.  falciparum  telomeric  repeat  AACCCTA  by  one  nucleotide.  A 
total  of  71  contigs  were  found  to  contain  telomeric  repeat  sequences 
arranged  in  tandem,  with  the  largest  array  consisting  of  186  copies. 
The  P.  y.  yoelii  subtelomeric  chromosomal  regions  show  little  repeat 
structure  compared  with  those  of  P.  falciparum.  A  survey  of  tandem 
repeats  in  the  entire  genome  found  only  a  few  in  the  telomeric  or 
subtelomeric  regions,  specifically  a  15  base  pair  (bp)  (45  copies)  and 
a  31-bp  (up  to  10  copies),  both  of  which  were  found  on  multiple 
contigs,  and  a  36-bp  repeat  that  occurred  on  one  contig.  No  repeat 
element  that  corresponds  to  Rep20,  a  highly  variable  21 -bp  unit  that 
spans  up  to  22  kb  in  P.  falciparum  telomeres,  was  found. 

The  telomeric  and  subtelomeric  regions  of  P.  y.  yoelii  contigs 
show  extensive  large-scale  similarity,  indicating  that  these  regions 
undergo  chromosomal  exchange  similar  to  that  reported  for  P. 
falciparum  (see  ref.  30).  The  longest  subtelomeric  contig  is  approxi¬ 
mately  27  kb  (see  Supplementary  Fig.  C)  and  is  homolopus  to 
other  subtelomeric  contigs  across  its  entire  length,  indicating  that 
the  region  of  chromosomal  exchange  extends  at  least  this  distance 
into  the  subtelomeres.  Recent  data  have  shown  that  clustering  of 
telomeres  at  the  nuclear  periphery  in  asexual  and  sexual  stage  P. 
falciparum  parasites  may  promote  sequence  exchange  between 
members  of  subtelomeric  virulence  genes  on  heterologous  chromo¬ 
somes,  resulting  in  diversification  of  antigenic  and  adhesive  pheno¬ 
types  (see  ref.  31  for  review).  The  suggestion  of  extensive 
chromosome  exchange  in  P.  y.  yoelii  indicates  that  a  similar  system 
for  generating  antigenic  diversity  of  the  yir,  Py235  and  other  gene 
families  located  within  subtelomeric  regions  may  exist. 

A  genome-wide  synteny  map 

The  Plasmodium  lineage  is  estimated  to  have  arisen  some  lOO-lSO 
million  years  ago^^,  and  species  of  the  parasite  are  known  to  infect 
birds,  mammals  and  reptiles^^.  On  the  basis  of  the  analysis  of  small 
subunit  (SSU)  ribosomal  RNA  sequences,  the  closest  relative  to  P. 
falciparum  is  Plasmodium  reichenowiy  a  parasite  of  chimpanzees, 
with  the  rodent  malaria  species  forming  a  distinct  clade^^’^^.  Early 
gene  mapping  studies  have  shown  that  regions  of  gene  synteny  exist 
between  species  of  rodent  malaria^  and  between  human  malaria 
species^^’^^  despite  extensive  chromosome  size  polymorphisms 
between  homologous  chromosomes^®.  This  level  of  gene  synteny 
seems  to  decrease  as  the  phylogenetic  distance  between  Plasmodium 
species  increases®^  Before  the  Plasmodium  genome  sequencing 
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projects,  the  degree  to  which  conservation  of  synteny  extended 
across  Plasmodium  genomes  was  not  fully  apparent. 

Using  the  P.  falciparum  and  P.  y.  yoelii  genome  data,  we  have 
constructed  a  genome-wide  syntenic  map  between  the  species.  To 
avoid  confounding  factors  inherent  in  DNA-based  analyses  of 
(A-l-T)-rich  genomes,  we  first  calculated  the  protein  similarity 
between  all  possible  protein-coding  regions  in  both  data  sets  using 
MUMmer".  Sensitivity  was  ensured  through  the  use  of  a  minimum 
word  match  length  of  five  amino  acids  chosen  to  identify  seed 
maximal  unique  matches  (MUMs).  By  comparison,  the  recent 
human-mouse  synteny  analysis  used  a  match  length  of  1 1  (ref.  8). 
Using  this  method,  which  is  independent  of  gene  prediction  data, 
2,212  sequences  could  be  aligned  (tiled)  to  P.  falciparum  chromo¬ 
somes,  representing  a  cumulative  length  of  16.4  Mb  of  sequence,  or 
over  70%  of  the  P.  y  yoelii  genome  (see  Supplementary  Table  C). 
The  per  cent  of  each  P.  falciparum  chromosome  covered  with  P.  y 
yoelii  matches  varies  from  12%  (chromosome  4)  to  22%  (chromo¬ 
somes  1  and  14),  with  an  average  of  about  18%.  The  spatial 
arrangement  of  the  tiling  paths  (see  Fig.  1  in  ref.  30)  confirms 
previous  suggestions’  that  most  of  the  conserved  matches  are  found 
within  the  body  of  Plasmodium  chromosomes,  and  confirms  the 
absence  of  var,  rif  and  stevor  homologues  in  the  P  y  yoelii  genome. 

Although  the  tiling  paths  indicate  the  degree  of  conservation  of 
gene  order  between  P  falciparum  and  P.  y.  yoelii,  longer  stretches  of 
contiguous  P.  y  yoelii  sequence  are  necessary  to  examine  this  feature 
in  depth.  Accordingly,  we  carried  out  linl^ge  of  many  P.  y  yoelii 
assemblies  adjacent  to  each  other  along  the  tiling  paths.  First,  1,050 
adjacent  contigs  were  linked  on  the  basis  of  paired  reads  as 
determined  by  Grouper  software.  Second,  P.  y  yoelii  ESTs  were 
aligned  to  the  tiling  paths,  and  those  found  to  overlap  sequences 
adjacent  in  the  tiling  path  were  used  as  evidence  to  link  a  further  236 
P  y*  yoelii  sequences.  Third,  amplification  of  the  sequence  between 
adjacent  contigs  in  the  tiling  paths  linked  a  further  817  assemblies. 
Linkage  of  P.  y  yoelii  sequences  by  these  methods  resulted  in  the 
formation  of  457  syntenic  groups  from  2,212  original  contigs, 
ranging  in  length  from  a  few  kilobases  to  more  than  800  kb.  Syntenic 
groups  were  assigned  to  a  P.  y  yoelii  chromosome  where  possible 
through  the  use  of  a  partial  physical  map’.  Thus,  long  contiguous 
sections  of  the  P  y  yoelii  genome  with  accompanying  P.  y.  yoelii 
chromosomal  location  can  be  assigned  to  each  P.  falciparum 
chromosome  (see  Fig.  1  in  ref.  30).  The  degree  of  conservation  of 
gene  order  between  the  species  was  examined  using  ordered  and 
orientated  syntenic  groups  and  Position  Effect  software.  Of 4,300  P 
y  yoelii  genes  within  the  syntenic  groups,  3,145  (73%)  were  found  to 
match  a  region  of  P.  falciparum  in  conserved  order. 

One  section  of  the  syntenic  map  between  P.  falciparum  and  P  y 


yoelii  in  particular — associated  AArith  P.  falciparum  chromosomes  4 
and  10  and  P  y  yoelii  chromosome  5 — provides  a  detailed  snapshot 
of  synteny  between  the  species.  Chromosome  5  of  P  y.  yoelii  has 
received  particular  attention  otving  to  the  localization  of  a  number 
of  sexual-stage-specific  genes  to  it’‘,  and  because  truncated  versions 
of  the  chromosome  are  found  in  lines  of  the  rodent  malaria  parasite 
P.  berghei,  which  is  defective  in  gametocytogenesis".  Genomic 
resources  available  for  P  berghei  chromosome  5  include  chromo¬ 
some  markers  and  long-range  restriction  maps”.  Exploiting  the 
high  level  of  synteny  of  rodent  malaria  parasite  chromosomes’, 
these  tools  were  applied  in  combination  with  further  mapping 
studies  to  close  the  syntenic  map  of  chromosome  5  of  P  v.  voelii 

(Fig-  2). 

Approximately  0.8  Mb  of  P.  y  yoelii  chromosome  5  (estimated 
total  length  of  1.5  Mb)  could  be  linked  into  one  group  that  is 
syntenic  to  P.  falciparum  chromosome  10  and  P.  falciparum 
chromosome  4.  From  a  total  of  243  genes  predicted  in  the  syntenic 
region  of  P.  falciparum  chromosome  10,  and  34  genes  predicted  in 
the  syntenic  region  of  chromosome  4, 171  (70%)  and  22  (65%)  of 
these,  respectively,  have  homologues  along  P  y  yoelii  chromosome  5 
that  appear  in  the  same  order.  Pairs  of  homologous  genes  that  map 
to  regions  of  conserved  synteny  between  P  y  yoelii  and  P.  falciparum 
are  probably  orthologues,  confirmed  by  the  finding  that  most  of 
these  homologous  pairs  are  also  reciprocal  best  matches  between  the 
P  falciparum  and  P  y  yoelii  proteins.  Genes  in  the  synteny  gap  on 
chromosome  10  (Fig.  2)  include  a  glutamate-rich  protein,  S  antigen, 
MSP3,  MSP6  and  liver  stage  antigen  1,  several  of  which  are  prime 
vaccine  antigen  candidates  in  P.  falciparum.  Genes  in  the  synteny 
gap  on  chromosome  4  include  four  var  and  two  rif  genes,  which 
make  up  one  of  the  four  internal  clusters  of  var/rif  genes  found  in  P 
falciparum  (see  ref.  30).  A  series  of  uncharacterized  hypothetical 
genes  occur  on  the  contigs  that  overlap  these  regions  in  P.  y  yoelii. 

An  intriguing  finding  from  the  study  of  chromosome  5  has  been 
the  analysis  of  the  syntenic  break  point  between  P.  falciparum 
chromosomes  4  and  10.  The  final  P.  y.  yoelii  contig  in  the  tiling 
path  with  significant  synteny  to  P.  falciparum  chromosome  10  also 
contains  the  external  transcribed  sequence  (ETS)  of  the  SSU  rRNA 
C  unit,  ’ae  synteny  resumes  on  P  falciparum  chromosome  4  in  a 
P  y.  yoelii  contig  that  also  contains  the  ETS  of  the  large  subunit 
(LSU)  of  the  same  rRNA  unit.  (No  rRNA  unit  sequences  are  located 
on  P.  falciparum  chromosomes  4  and  10;  matches  to  contigs 
containing  these  genes  occur  in  coding  regions  of  other  genes.) 
Both  P  y.  yoelii  contigs  are  linked  to  each  other  through  a  third 
contig  that  contains  the  remaining  elements  (SSU,  5.8S,  LSU,  and 
internal  transcribed  sequences  1  and  2)  of  the  complete  rRNA  unit 
(Fig.  2).  Thus  it  seems  that  the  break  in  synteny  between  Plasmo- 
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Rflure  2  Conservation  of  gene  synteny  between  P.  y.  yoelii  chromosome  5  and  P.  Each  coloured  line  represents  a  pair  of  orthologous  genes  present  in  the  two  soeries 
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Figure  3  Global  alignment  scheme  of  a  syntenic  region  betoeen  P.  falciparum  and  MUMmer  matches  between  the  two  species  are  represented  as  thick  blue  lines.  For  the 
P.  y.  yoef  encompassing  ten  orthologous  gene  pairs  and  nine  intergenic  regions,  White  ten  orthologous  gene  pairs,  synonymous  mutations 

boxes  represent  genes  that  have  no  orthologue  and  were  excluded  from  analysis;  green  and  non-synonymous  mutations  per  non-synonymous  site  [d^.  filled  bars)  were  estimated 
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boxes  represent  gene  models  that  were  refined;  red  boxes  represent  unaltered  gene 
models;  arrowheads  represent  gene  orientation  on  the  DNA  molecule.  Clusters  of 


diutn  chromosomes  has  occurred  within  a  single  rRNA  unit,  a 
phenomenon  first  reported  in  proka^yotes^^  Six  rRNA  units  reside 
as  individual  operons  on  P.  falciparum  chromosomes  1,  5,  7,  8,  11 
and  13  respectively  (ref.  30),  in  contrast  to  rodent  malaria  species 
that  have  four^.  Intriguingly,  breaks  in  the  synteny  between  P.  y. 
yoelii  and  P.  falciparum  can  be  mapped  to  almost  all  rRNA  unit  loci 
on  the  P.  falciparum  chromosomes  (see  Fig.  1  of  ref  30).  A  full 
analysis  of  this  potential  phenomenon  is  outside  the  scope  of  this 
study,  but  these  results  provide  preliminary  evidence  for  one 
possible  mechanism  underlying  synteny  breakage  that  may  have 
occurred  during  evolution  of  the  Plasmodium  genus— that  of 
chromosome  breakage  and  recombination  at  sites  of  rRNA  units. 

Comparative  alignment  of  syntenic  regions 

Recent  comparative  studies  have  revealed  that  the  fine  detail  of  short 
stretches  of  the  rodent  and  human  malaria  parasite  genomes  is 
remarkably  conserved^^  and  that  such  comparisons  are  useful  for 
gene  prediction  and  evolutionary  studies.  Accordingly,  we  used  a 
comparison  of  the  longest  assembly  of  P.  y.  yoelii  (MALPY00395, 
51,3  kb)  and  its  syntenic  region  in  P  falciparum  (chromosome  7,  at 
coordinates  1,131-1,183  kb)  as  a  case  study  for  a  preliminary 
evolutionary  analysis  of  the  two  genomes.  Gene  prediction  pro¬ 
grams  run  against  these  two  regions  identified  11  genes  in  the 
syntenic  region  of  both  species  (Fig.  3),  eight  of  which  are  ortho¬ 
logous  gene  pairs  (genes  1,  3-8  and  10).  The  structures  of  two 
additional  gene  pairs  (genes  2a/b  and  9)  were  refined  through 
manual  curation  of  erroneous  gene  boundaries.  Three  hypothetical 
genes,  two  in  P  falciparum  and  one  in  P.  y.  yoeliU  had  no  discernible 
orthologue  in  the  other  species;  the  presence  of  multiple  stop 
codons  in  these  areas  suggests  that  the  genes  may  have  beconie 
pseudogenes.  A  global  alignment  at  the  DNA  level  of  the  syntenic 
region  (Fig.  3)  reveals  the  similarity  between  species  in  intergenic 
regions  to  be  almost  negligible,  as  mirrored  in  similar  syntenic 
comparisons  of  mouse  and  human^®*^^  Moreover,  the  mutation 
saturation  observed  in  intergenic  regions  suggests  that  phylogenetic 
footprinting’  can  be  used  to  identify  conserved  motifs  between 
species  that  may  be  involved  in  gene  regulation. 

In  contrast  to  intergenic  regions,  the  similarity  between  species  in 
coding  regions  is  relatively  high.  The  average  number  of  non- 
synonymous  substitutions  per  non-synonymous  site,  d^,  between 
the  two  species  is  26%  (i  12%).  Synonymous  sites,  ds,  are  saturated 
(average  dg  >  1),  which  supports  the  lack  of  similarity  observed 
within  intergenic  regions.  These  values  are  considerably  higher  than 
those  reported  for  human— rodent  comparisons,  which  are  approxi¬ 
mately  7.5%  and  45%  for  non-synonymous  and  synonymous 
substitutions,  respectively"^®.  The  cause  of  such  apparent  disparities 


remains  unknown,  but  may  be  a  consequence  of  extreme  genome 
composition  or  the  short  generation  time  of  the  parasite. 

Rodent  malaria  species  as  models  for  P.  falciparum  biology 

The  usefulness  of  rodent  malaria  species  as  models  for  the  study  of  P. 
falciparum  is  controversial.  It  is  apparent  that  rodent  models  are  the 
first  port  of  call  when  preliminary  in  vivo  evidence  of  antimalarial 
drug  efficacy,  immune  response  to  vaccine  candidates,  and  life-cycle 
adaptations  in  the  face  of  drug  or  vaccine  challenge  are  required. 
Different  species  of  malaria  parasite  have  developed  different 
mechanisms  of  resistance  to  the  antimalarial  drug  chloroquine, 
despite  a  similar  mode  of  action  of  the  drug  (reviewed  in  ref.  49).  It 
seems  that  mechanisms  developed  by  the  parasite  to  evade  an 
inhospitable  environment,  whether  caused  by  antimalarial  drugs 
or  the  host  immune  system,  ma.y  differ  widely  from  species  to 
species.  A  model  involving  evolution  of  different  genes  in  Plasmo¬ 
dium  species  as  a  response  to  different  host  environments  is 
consistent  with  the  comparison  of  the  P  falciparum  and  P  y.  yoelii 
genomes  presented  here;  conservation  of  synteny  between  the  two 
species  is  high  in  regions  of  housekeeping  genes,  but  not  in  regions 
where  genes  involved  in  antigenic  variation  and  evasion  of  the  host 
immune  system  are  located.  On  the  one  hand,  this  can  be  inter¬ 
preted  as  a  blow  to  the  systematic  identification  of  all  orthologues  of 
antigen  genes  between  P  falciparum  and  P  y.  yoelii  that  could  be 
used  in  the  design  of  a  malaria  vaccine.  On  the  other  hand,  a  picture 
is  emerging  of  selecting  a  model  malaria  species  based  on  the 
complement  of  genes  that  best  fit  the  phenotypic  trait  under 
study.  Thus  the  presence  of  homologues  of  the  yir  family  inay 
make  P  y,  yoelii  an  attractive  model  for  studying  antigenic  variation 
in  P  vivax.  Furthermore,  identification  of  orthologues  in  the 
genomes  of  relatively  distant  rodent  and  human  malaria  parasites 
will  facilitate  finding  orthologues  in  other  model  malaria  species, 
for  example  monkey  models  of  malaria  such  as  Plasmodium 
knowlesi.  ^ 

Methods 

Genome  and  EST  sequencing 

Plasmodium  yoelii  yoelii  17XNL  line"",  selected  from  an  isolate  taken  from  the  blood  of  a 
wild-caught  thicket  rat  in  the  Central  African  Republic"^  is  a  non-lethal  strain  with  a 
preference  for  development  in  reticulocytes.  Clone  1.1  was  obtained  through  serial 
dilution  of  sporozoites.  Parasites  were  grown  in  laboratory  mice  no  more  than  three  blood 
passages  from  mosquito  passage  to  limit  chromosome  instability,  collected  by 
exsanguination  into  heparin,  and  host  mouse  leukocytes  were  removed  by  filtration.  Small 
insert  libraries  (average  insert  size  1.6  kb)  were  constructed  in  pUC-derived  vectors  after 
nebulization  of  genomic  DNA.  DNA  sequencing  of  plasmid  ends  used  ABI  Big  Dye 
terminator  chemistry  on  ABI3700  sequencing  machines.  A  total  of  222,716  sequences 
(82%  success  rate),  averaging  662  nucleotides  in  length,  were  assembled  using  TIGR 
Assembler^  BLASTN  of  the  P.  y.  yoelii  contigs  and  singletons  against  the  complete  set  of 
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Celera  mouse  contigs®,  using  a  cutoff  of  90%  identity  over  100  nucleotides,  identified 
contaminating  mouse  sequences  that  were  subsequently  removed.  Contigs  were  assigned 
to  groups  using  Grouper*^  Each  contig  was  assigned  an  identifier  in  the  format 
MALPYOOOOr. 

Proteomic  analysis 

MudP^  technology  and  methods  were  as  described  in  ref.  23.  Sporozoites  of  P.  y.  yoelii 
were  dissected  from  infected  Anopheles  stephensi  mosquito  salivary  glands,  and  P  y  yoelii 
gametocytes  were  prepared  as  described”  CeUular  debris  from  uninfected  mosquitoes 
Md  mouse  erythrocytes  were  analysed  as  controls.  Tandem  mass  spectrometry  (MS/MS) 
data  sets  were  searched  against  several  databases:  the  complete  set  off!  y  yoelii  full  and 
partial  proteins  (7,860  total);  791.324  P.  /.  yoelii  open  reading  frames  (stop-to-stop  ORFs 
over  15  amino  acids  and  start-to-stop  ORFs  over  100  amino  acids);  57.885  ORFs  from 
NCBI's  ReBeq  for  human,  mouse  and  rat;  15,570  Anopheles,  Aedes  and  Drosophila 
tnelanogaster  proteins  from  GenBank;  and  165  common  protein  contaminants  (for 
example,  trypsin,  bovine  serum  albumin). 

Gene  finding  and  annotation 

The  splice  site  recognition  module  of  GlimmerMExon  was  trained  specifically  for  P  yoelii 

genome  dau,  using  DNA  sequences  extracted  fromaset  of  1,166  donor  and  1,166  acceptor 

ates  confirmed  by  P  y  yoelii  ESTs.  Phat  and  the  exon  recognition  module  of 
GlimmerMExon  were  trained  on  P  falciparum  data  as  described  (see  ref.  54).  Combiner 
^  used  to  generate  a  final  ranked  list  of  P  y  yoelii  gene  models,  and  TIGR’s  Eukaryotic 
Genome  Control  suite  of  programs  was  used  for  automated  annotation  of  these  (both 
described  in  ref.  54).  Automated  gene  names  were  assigned  to  proteins  by  taking  the 
equivalogue*  name  of  the  hidden  Markov  model  (HMM)  associated  with  the  protein 
where  possible,  or  where  no  HMM  was  assigned,  on  the  basis  of  the  best-paired  alignment. 
Each  protein  was  assigned  an  identifier  in  the  format  ‘PYOOOOl*. 

Paralogous  gene  families 

Proteins  encoded  by  multigene  famUies  were  identified  by  a  domain-based  clustering 
algorithm  developed  at  TIGR.  Families  were  regarded  as  potentially  Plasmodium-  or 
yoe/n-sp^ific  if  th^  were  not  described  by  any  Pfkm”  or  TIGRFAM”  domains  and  if  the 
automatic  annotation  process  had  not  ascribed  names  corresponding  to  widely 
distributed  proteins.  HMMs  for  these  families  were  built  using  the  HMMER  package 
vereion  2.1.1  (ref.  57).  Newly  constructed  models  were  then  used  to  search  the  P  yoelii, 

P  falciparum  and  GenBank  databases  to  define  the  scope  of  the  femilies. 

Telomerlc/subtelomerlc  repeat  analysis 

Subtelomeric  contigs  were  identified  through  alignment  using  MUMmer2  (ref.  40)  with  a 
minimum  exact  match  ranging  from  3(M0  bases.  Tandem  Repeat  Finder”  used  the 
following  settings:  match  =  2,  mismatch  -  7,  PM  (match  probability)  ^  75,  PI  (indel 
probability)  =  10,  minscore  =  400,  max  period  =  700. 

Comparative  analyses 

Gene  model  predictions  in  the  syntenic  region  of  P  falciparum  chromosome  7  were 
inspected  manually,  and  bi-directional  best  hits  between  gene  models  that  respected 
consei^  s^tenies  were  selected.  A  global  alignment  of  the  two  sequences  was  calculated 
”“<^*eotide  sequences  of  predicted  gene  models  were  aligned  using 
CLUSTALW“  with  default  parameters,  and  refined  manually.  The  number  of  substitutions 
per  sjmonymous  (ds)  and  nonsynonymous  (d^)  sites  were  estimated  using  the  Nei  and 
Gojobori  method”.  Conservation  of  gene  order  was  established  using  Position  Effect 
(http://www.tigr.org/software),  where  matches  between  P  falciparum  and  P  y  yoelii  genes 
were  calculated  using  BLASTP  with  a  cutoff  E  value  of  10”'  The  query  and  hit  gene  from 
each  match  were  defined  as  anchor  points  in  gene  sets  composed  of  adjacent  genes.  Up  to 
ten  genes  upstream  and  downstream  from  each  anchor  gene  were  used  in  creating  the  gene 
set.  An  optimal  alignment  was  calculated  between  the  ordered  gene  sets  using  BLASTP  per 
cent  simUarity  scores  and  a  Unear  gap  penalty.  Low-scoring  aUgnments  with  a  cumulative 
per  cent  simUarity  less  than  100  were  not  used.  Each  optimal  alignment  provided  a  list  of 
matching  genes  in  conserved  order  between  P.  falciparum  and  P  y  yoelii 
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The  completion  of  the  Plasmodium  falciparum  clone  3D7  genome  provides  a  basis  on  which  to  conduct  comparative  proteomics 
studies  of  this  human  pathogen.  Here,  we  applied  a  high-throughput  proteomics  approach  to  identify  new  potential  drug  and 
vaccine  targets  and  to  better  understand  the  biology  of  this  complex  protozoan  parasite.  We  characterized  four  stages  of  the 
parasite  life  cycle  (sporozoites,  merozoltes,  trophozoites  and  gametocytes)  by  multidimensional  protein  Identification  technology. 
Functional  profiling  of  over  2,400  proteins  agreed  with  the  physiology  of  each  stage.  Unexpectedly,  the  antigenicaily  variant 
proteins  of  var  and  genes,  defined  as  molecules  on  the  surface  of  Infected  erythrocytes,  were  also  largely  exprwsed  In 
sporozoites.  The  detection  of  chromosomal  clusters  encoding  co-expressed  proteins  suggested  a  potential  mechanism  for 
controlling  gene  expression. 


The  life  qrde  of  Plasmodium  is  extraordinarily  complex,  requiring 
specialized  protein  expression  for  life  in  both  invertebrate  and 
vertebrate  host  environments,  for  intracellular  and  extracellular 
survival,  for  invasion  of  multiple  cell  types,  and  for  evasion  of  host 
immune  responses.  Interventional  strategies  including  anti- 
malarial  vaccines  and  drugs  will  be  most  effective  if  targeted  at 
specific  parasite  life  stages  and/or  specific  proteins  expressed  at 
these  stages.  The  genomes  of  P.  falciparum'  and  P.  yoelii  yoeliP  are 
now  completed  and  offer  the  promise  of  identifying  new  and 
effective  drug  and  vaccine  targets. 

Functional  genomics  has  fundamentally  changed  the  traditional 
gene-by-gene  approach  of  the  pre-genomic  era  by  capitalizing  on 
die  success  of  genome  sequencing  efforts.  DNA  microarrays  have 
been  successfully  used  to  study  differential  gene  expression  in  the 
abundant  blood  stages  of  the  Plasmodium  parasite^’^.  However, 
transcriptional  analysis  by  DNA  microarrays  generally  requires 
microgram  quantities  of  RNA  and  has  been  restricted  to  stages 
that  can  be  cultivated  in  vitro,  limiting  current  large-scale  gene 
expression  analyses  to  the  blood  stages  of  P.  falciparum.  As  several 
key  stages  of  the  parasite  life  cycle,  in  particular  the  pre-erythrocytic 
stages,  are  not  readily  accessible  to  study,  and  as  differential  gene 
expression  is  in  fact  a  surrogate  for  protein  expression,  global 
proteomic  analyses  offer  a  unique  means  of  determining  not  only 
protein  expression,  but  also  subcellular  localization  and  post- 
transladonal  modifications. 

We  report  here  a  comprehensive  view  of  the  protein  complements 
isolated  from  sporozoites  (the  infectious  form  injected  by  the 
mosquito),  merozoites  (the  invasive  stage  of  the  erythrocytes), 
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trophozoites  (the  form  multiplying  in  erythrocytes),  and  gameto¬ 
cytes  (sexual  stages)  of  the  human  malaria  parasite  P.  falciparum. 
These  proteomes  were  analysed  by  multidimensional  protein 
identification  technology  (MudPIT),  which  combines  in-line, 
high-resolution  liquid  chromatography  and  tandem  mass  spec¬ 
trometry*.  Two  levels  of  control  were  implemented  to  differentiate 
parasite  from  host  proteins.  By  using  combined  host-parasite 
sequence  databases  and  noninfected  controls,  2,415  parasite  pro¬ 
teins  were  confidently  identified  out  of  thousands  of  host  proteins; 
that  is,  46%  of  all  gene  products  were  detected  in  four  stages  of  the 
Plasmodium  life  cycle  (Supplementary  Table  1). 

Comparative  proteomics  throughout  the  life  cycle 

The  sporozoite  proteome  appeared  markedly  different  from  the 
other  stages  (Table  1).  Almost  half  (49%)  of  the  sporozoite  proteins 


Table  1  Comparative  summary  of  the  protein  lists  for  each  stage 

Protein  count 

Sporozoites 

Merozoites 

Trophozc^es 

Gametocytes 

162 

X 

X 

X 

X 

197 

- 

X 

X 

X 

53 

X 

- 

X 

X 

28 

X 

X 

- 

X 

36 

X 

X 

X 

148 

- 

- 

X 

X 

73 

X 

- 

X 

120 

X 

- 

- 

X 

84 

- 

X 

X 

- 

80 

X 

- 

X 

- 

65 

X 

X 

- 

- 

376 

- 

- 

- 

X 

286 

- 

- 

X 

- 

204 

- 

X 

513 

X 

- 

- 

2,415 

1,049 

839 

1,036 

1,147 

Whote-ce«  protein  lysates  were  retained  from,  on  average.  17  x  10®  sporozoites.  4.5  x  10® 
trophozoites,  2.75  x  10®  merozoites,  and  6.5  x  10®  gametocytes. 
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were  unique  to  this  stage,  which  shared  an  average  of  25%  of  its 
proteins  with  any  other  stage.  On  the  other  hand,  trophozoites, 
merozoites  and  gametocytes  had  between  20%  and  33%  unique 
proteins,  and  they  shared  between  39%  and  56%  of  their  proteins. 
Consequently,  only  152  proteins  (6%)  were  common  to  all  four 
stages.  Those  common  proteins  were  mostly  housekeeping  proteins 
such  as  ribosomal  proteins,  transcription  factors,  histones  and 
cytoskeletal  proteins  (Supplementary  Table  1).  Proteins  were  sorted 
into  main  functional  classes  based  on  the  Munich  Information 
Centre  for  Protein  Sequences  (MIPS)  catalogue^  with  some  adap¬ 
tations  for  classes  specific  to  the  parasite,  such  as  cell  surface  and 
apical  organelle  proteins  (Fig.  1).  When  considering  the  annotated 
proteins  in  the  database,  some  marked  differences  appeared 
between  sporozoites  and  blood  stages  (Fig,  1).  Although  great 
care  was  taken  to  ensure  that  the  results  reflect  the  state  of  the 
parasite  in  the  host,  a  portion  of  the  data  set  may  reflect  the 
parasite’s  response  to  different  purification  treatments.  However, 
the  stage-specific  detection  of  known  protein  markers  at  each  stage 
established  the  relevance  of  our  data  set. 

The  merozoite  proteome 

Merozoites  are  released  firom  an  infected  erythrocyte,  and  after  a 
short  period  in  the  plasma,  bind  to  and  invade  new  erythrocytes. 
Proteins  on  the  surface  and  in  the  apical  organelles  of  the  merozoite 
mediate  cell  recognition  and  invasion  in  an  active  process  involving 
an  actin-myosin  motor.  Four  putative  components  of  the  invasion 
motor^,  merozoite  cap  protein- 1  (MCPl),  actin,  myosin  A,  and 
myosin  A  tail  domain  interacting  protein  (MTIP),  were  abundant 
merozoite  proteins  (Supplementary  Table  2).  Abundant  merozoite 
surface  proteins  (MSPs)  such  as  MSPl  and  MSP2  are  linked  by  a 
glycosylphosphatidyl  (GPI)  anchor  to  the  membrane,  and  both 
have  been  implicated  in  immune  evasion  (reviewed  in  ref.  8).  A 
second  family  of  peripheral  membrane  proteins,  represented  by 
MSP3  and  MSP6,  was  also  detected  (Fig.  2a),  although  these 
proteins  are  largely  soluble  proteins  of  the  parasitophorous  vacuole, 
which  are  released  on  schizont  rupture.  Other  vacuolar  proteins, 
such  as  the  acidic  basic  repeat  antigen  (ABRA)  and  serine  repeat 
antigen  (SERA),  were  detected  in  the  merozoite  fraction,  buf  some 
such  as  S-antigen^  were  not  (Supplementary  Table  2).  Notably, 
MSPS  and  a  related  MSP8-like  protein  were  only  identified  in 
sporozoites  (Fig.  2a).  Some  MSPs  are  diverse  in  sequence  and 
may  be  extensively  modified  by  proteolysis;  these  features,  together 
with  the  association  of  a  variety  of  peripheral  and  soluble  proteins, 
provide  for  a  complex  surface  architecture. 

Many  apical  organellar  proteins,  in  the  micronemes  and  rhop- 
tries,  have  a  single  transmembrane  domain.  Among  these  proteins, 
apical  membrane  antigen  1  (AMAl)  and  MAEBL  were  found  in 


both  sporozoite  and  merozoite  preparations  (Fig.  2a).  Erythrocyte¬ 
binding  antigens  (EBA),  such  as  EBA  175  and  EBA  140/BAEBL, 
were  found  only  in  the  merozoite  and  trophozoite  fractions.  Of 
note,  the  reticulocyte-binding  protein  (PfRH)  family  (PFDOllOw, 
MAL13P1.176,  PF13_01998,  PFL2520w  and  PFD1150c),  which  has 
similarity  with  the  Py235  family  of  P.  y.  yoelii  rhoptry  proteins  and 
the  Plasmodium  vivax  reticulocyte-binding  proteins,  was  not 
detected  in  the  merozoite  fraction.  Some  PfRH  proteins  were, 
however,  detected  in  sporozoites  (Fig.  2a),  including  RH3,  which 
is  a  transcribed  pseudogene  in  blood  stages^^.  Components  of  the 
low  molecular  mass  rhoptry  complex,  the  rhoptry-associated  pro¬ 
teins  (RAP)  1, 2  and  3,  were  all  found  in  merozoites.  RAPl  was  also 
detected  in  sporozoites.  The  high  molecular  mass  rhoptry  protein 
complex  (RhopH),  together  with  ring-infected  erythrocyte  surface 
antigen  (RESA),  which  is  a  component  of  dense  granules,  is 
transferred  intact  to  new  erythrocytes  at  or  after  invasion  and 
may  contribute  to  the  host  cell  remodelling  process.  RhopH  1, 
RhopH2  (PFI1445w;  Ling,  I.  T,  et  aU  unpublished  data)  and 
RhopH3  were  found  in  the  merozoite  proteome.  RhopH  1 
(PFC0120w/PFC0110w)  has  been  shown  to  be  a  member  of  the 
cyto-adherence  linked  asexual  gene  family  (CLAG)“;  however,  the 
presence  of  CLAG9  in  the  merozoite  fraction  (Fig.  2a)  suggests  that 
CLAG9  may  also  be  a  RhopH  protein,  casting  some  doubt  on  the 
proposed  role  for  this  protein  in  cyto-adherence^^. 

The  trophozoite  proteome 

After  erythrocyte  invasion  the  parasite  modifies  the  host  cell.  The 
principal  modifications  during  the  initial  trophozoite  phase  (lasting 
about  30  h)  allow  the  parasite  to  transport  molecules  in  and  out  of 
the  cell,  to  prepare  the  surface  of  the  red  blood  cell  to  mediate  cyto- 
adherence,  and  to  digest  the  cytoplasmic  contents,  particularly 
haemoglobin,  in  its  food  vacuole.  In  the  next  phase  of  schizogony 
(the  final  —18  h  of  the  asexual  development  in  the  blood  cell), 
nuclear  division  is  followed  by  merozoite  formation  and  release. 

Knob-associated  histidine-rich  protein  (KAHRP)  and  erythro¬ 
cyte  membrane  proteins  2  and  3  (EMP2  and  -3)  bind  to  the 
erythrocyte  cytoskeleton  (Fig.  2a).  Of  the  proteins  of  the  parasito¬ 
phorous  vacuole  and  the  tubovesicular  membrane  structure  extend¬ 
ing  into  the  cytoplasm  of  the  red  blood  cell,  three  (the  skeleton¬ 
binding  protein  1,  and  exported  proteins  EXPl  and  EXP2)  were 
represented  by  peptides  (Fig.  2a);  although  a  fourth  (Sari  homol- 
ogue,  small  GTP-binding  protein;  PFDOSlOw)  was  not.  It  is  likely 
that  one  or  more  of  the  hypothetical  proteins  detected  only  in  the 
trophozoite  sample  are  involved  in  these  unusual  structures. 

Digestion  of  haemoglobin  is  a  major  parasite  catabolic  process  . 
Members  of  the  plasmepsin  family  (aspartic  proteinases;  PF14_0075 
to  PF14_0078)^^,  falcipain  family  (cysteine  proteinases;  PF11„0161, 
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Figure  1  Functional  profiles  of  expressed  proteins.  Proteins  identified  in  each  stage  are  catalogueMo  avoid  redundancy,  only  one  class  was  assigned  per  protein.  The  complete 

plotted  as  a  function  of  their  broad  functional  classification  as  defined  by  the  MIPS  protein  list  is  given  in  Supplementary  Table  1 . 
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PF11_0162  and  PF11_0165)'^  and  falcUysin  (a  metallopeptidase; 
PF13_0322)'®  implicated  in  this  process  were  all  clearly  identified 
(Supplementary  Table  1).  Several  proteases  expressed  in  the  mero- 
zoite  and  trophozoite  fractions,  and  not  involved  in  haemoglobin 
digestion,  niay  be  important  in  parasite  release  at  the  end  of 
schizogony,  invasion  of  the  new  cell,  or  merozoite  protein  proces¬ 
sing.  Possible  candidates  for  this  mechanism  include  cysteine 
proteinases  of  the  falcipain  and  SERA  families,  or  subtilisins  such 
as  SUBl  and  SUB2,  both  located  in  apical  organelles  (Fig.  2a). 

The  gametocyte  proteome 

Stage  V  gametocytes  are  dimorphic,  with  a  male;female  ratio  of  1 :4. 
They  are  arrested  in  the  cell  cycle  until  they  enter  the  mosquito 
where  development  is  induced  within  minutes  to  form  the  male  and 


female  gametes.  Gametocyte  structure  reflects  these  ensuing  fates; 
that  is,  the  female  has  abundant  ribosomes  and  endoplasmic 
reticulum/vesicular  network  to  re-initiate  translation,  whereas  the 
male  is  largely  devoid  of  ribosomes  and  is  terminally  differen- 
tiated'l 

Gametocyte-specific  transcription  factors,  RNA-binding  pro¬ 
teins,  and  gametocyte-specific  proteins  involved  in  the  regulation 
of  messenger  RNA  processing  (particularly  splicing  factors,  RNA 
helicases,  RNA-binding  proteins,  ribonucleoproteins  (RNPs)  and 
small  nuclear  ribonucleoprotein  particles  (snRNPS))  were  highly 
represented  in  the  gametocyte  proteome  (Supplementary  Table  1). 
Transcription  in  the  terminally  differentiated  gametocytes  is  ‘sup¬ 
pressed’,  but  the  female  gametocytes  contain  mRNAs  encoding 
gamete/zygote/ookinete  surface  antigens  (for  example,  P25/28) 
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stage-specific  proteins,  a,  Ceii  surface,  organeiie,  measured  in  each  stage  (proteins  not  detected  in  a  stage  are  represented  bv  biack 
chromosome  encoding  their  genes.  The  matrices  are  colour-coded  by  sequence  coverage  ’  ^ 
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that  are  subject  to  post-transcriptional  control;  this  control  is 
released  rapidly  during  gamete  development^^  Ribosomal  proteins 
were  largely  represented:  82%  of  known  small  subunit  (SSU) 
proteins  and  69%  of  known  large  subunit  (LSU)  proteins  were 
detected  in  gametocytes  compared  to  94%  and  82%,  respectively, 
from  all  stages  examined  (Supplementary  Table  1).  We  suggest  that 
this  reflects  the  accumulation  of  ribosomes  in  the  female  gameto- 
cyte  to  accommodate  for  the  sudden  increase  in  protein  synthesis 
required  during  gametogenesis  and  early  zygote  development. 

Other  protein  groupings  highly  represented  in  the  gametocyte 
were  in  the  cell  cycle/DNA  processing  and  energy  classes  (Fig.  1). 
The  former  is  consistent  with  the  biological  observation  that  the 
mature  gametocyte  is  arrested  in  GO  of  the  cell  cycle  and  will  require 
a  full  complement  of  pre-existing  cell  cycle  regulatory  cascades  to 
respond,  within  seconds,  to  the  gametogenesis  stimuli  (that  is, 
xanthurenic  acid  and  a  drop  in  temperature)^®.  Metabolic  pathways 
of  the  malaria  parasite  may  be  stage-specific,  with  asexual  blood 
stage  parasites  dependent  on  glycolysis  and  conversion  of  pyruvate 
to  lactate  (L-lactate  dehydrogenase)  for  energy.  In  the  gametocyte 
and  sporozoite  preparations,  peptides  from  enzymes  involved  in  the 
mitochondrial  tricarboxylic  acid  (TCA)  cycle  and  oxidative  phos¬ 
phorylation  were  identified  (Table  2).  This  observation  suggests  that 
gametocytes  have  fully  functional  mitochondria  as  a  p re- adaptation 
to  life  in  the  mosquito,  as  suggested  by  morphological  and  bio¬ 
chemical  studies^^  and  their  sensitivity  to  anti-malarials  attacking 
respiration  (primaquine  and  artimesinin-based  products)  .  It  will 
be  interesting  to  observe  whether  other  mosquito  and  liver  stages, 
which  show  similar  drug  sensitivities,  express  the  same  metabolic 
proteome. 

Cell  surface  proteins  (Fig.  I)  included  most  of  the  known  surface 
antigens  (Fig.  2a  and  Supplementary  Table  2).  However,  Pfs35  and  a 
sexual  stage-specific  kinase  (PF13_0258)  were  not  detected.  Never¬ 
theless  the  cultured  gametocytes  analysed  in  this  study  expressed  a 
specific  repertoire  of  rifin  and  PfEMPI  proteins  (Fig.  2b  and 
Supplementary  Table  2).  Together  these  observations  suggest  that 
the  gametocyte,  which  is  very  long-lived  in  the  red  blood  cell  (that 
is,  9-12  days  compared  with  2  days  for  the  pathogenic  asexual 
parasites),  expresses  a  limited  repertoire  of  the  highly  polymorphic 
families  of  surface  antigens  so  widely  represented  in  the  asexual 
parasites. 

The  sporozoite  proteome 

Sporozoites  are  injected  by  the  mosquito  during  ingestion  of  a 
blood  meal.  Although,  they  are  in  the  blood  stream  for  only 
minutes,  sporozoites  probably  require  mechanisms  to  evade  the 
host  humoral  immune  system  in  order  for  at  least  a  fraction  of 
the  thousands  of  sporozoites  injected  by  the  mosquito  to  survive  the 


hostile  environment  in  the  blood  and  successfully  invade 
hepatocytes. 

The  main  class  of  annotated  sporozoite  proteins  identified  was 
cell  surface  and  organelle  proteins  (Fig.  1).  Sporozoites  are  an 
invasive  stage  and  possess  the  apical  complex  machinery  involved 
in  host  cell  invasion.  As  observed  in  the  analysis  of  the  R  y.  yoelii 
sporozoite  transcriptome^^  actin  and  myosin  were  found  in  the 
motile  sporozoites  (Supplementary  Table  2).  Many  proteins  associ¬ 
ated  with  rhoptry,  micronemes  and  dense  granules  were  detected 
(Fig.  2a).  Among  the  proteins  found  were  known  markers  of  the 
sporozoite  stage,  such  as  the  circumsporozoite  protein  (CSP)  and 
sporozoite  surface  protein  2  (SSP2;  also  known  as  TRAP),  both 
present  in  large  quantities  at  the  sporozoite  surface  (Fig.  2a). 
Peptides  derived  from  CTRP  (circumsporozoite  protein  and  throm¬ 
bospondin-related  adhesive  protein  (TRAP) -related  protein),  an 
ookinete  ceil  surface  protein  involved  in  recognition  and/or  moti- 
lity^^  were  detected  in  the  sporozoite  fractions  (Supplementary 
Table  1). 

Most  surprisingly,  peptides  derived  from  multiple  vur  (coding  for 
PfEMPI)  and  n/ genes  were  identified  in  the  sporozoite  samples. 
PfEMPI  and  rifins  are  coded  for  by  large  multigene  families  (var 
and  and  are  present  on  the  surface  of  the  infected  red  blood 

cell.  No  peptides  derived  from  n/ genes  were  identified  in  the 
trophozoite  sample,  whereas  sporozoites  expressed  21  different 
rifins  and  25  PfEMPI  isoforms  (Fig.  2b);  that  is,  a  total  of  14%  of 
the  rif  genes  and  33%  of  the  var  genes  encoded  by  the  genome. 
Furthermore,  very  little  overlap  was  observed  between  stages:  only 
ten  PfEMPI  and  two  rifin  isoforms  expressed  in  sporozoites  were 
found  in  other  stages.  Whereas  in  the  blood  stream  the  asexual  stage 
parasites  undergo  asexual  multiplication  and  therefore  have  an 
opportunity  to  undergo  antigenic  ‘switching’  of  the  variant  antigen 
genes,  the  non-replicative  sporozoites  may  not  have  this  opportu¬ 
nity.  Expressing  such  a  polymorphic  array  of  var  (PfEMPI)  and  rif 
genes  could  be  part  of  a  sporozoite  survival  mechanism. 

Chromosomal  clusters  encoding  co-expressed  proteins 

The  distinct  proteomes  of  each  stage  of  the  Plasmodium  life  cycle 
suggested  that  there  is  a  highly  coordinated  expression  of  Plasmo¬ 
dium  genes  involved  in  common  processes.  Co-expression  groups 
are  a  widespread  phenomenon  in  eukaryotes,  where  mRNA  array 
analyses  have  been  used  to  establish  gene  expression  profiles. 
Analysis  of  co-regulated  gene  groups  facilitates  both  searching  for 
regulatory  motifs  common  to  co-regulated  genes,  and  predicting 
protein  function  on  the  basis  of  the  ‘guilt  by  association  model. 
Furthermore,  mRNA  analyses  in  Saccharomyces  cerevisiae^"^  and 
Homo  sapiens^^’^^  have  demonstrated  that  co-regulated  genes  do 
not  map  to  random  locations  in  the  genome  but  are  in  fact 


Table  2  Examples  on  enzymes  in  stage-specific  metabolic  pathways _ 

Stage 

Locus  Spz*  Mrz*  Tpz*  Gmt*  Enzyme 


EC  numbert 


End  of  glycolysis 

PF10_0363  1.2  -  2.4 

MAL6P1.160  8.6  66.9  18.8 

PF13_0141  46.2  83.9  70.9 

TCA  cycle  and  oxidative  phosphorylation 
PF10_0218  12.3 

PF13_0242  3.2  '  -  16.9 

PF08_0045  2.9  -  2.2 

PF10_0334  -  -  3.5 

PFL0630W  3.7 

PF14_0373  -  -  - 

PFB0795\w  _  _  - 

PFI1365W  _  -  - 

PFI1340W  -  -  - 

MAL6P1.242  30.4 


Pyruvate  kinase 
Pyruvate  kinase 
L-lactate  dehydrogenase 

Citrate  synthase 

Isocitrate  dehydrogenase  (NADP) 
2-Oxoglutarate  dehydrogenase  el  component 
Flavoprotein  subunit  of  succinate  dehydrogenase 
Iron-sulphur  subunit  of  succinate  dehydrogenase 
Ubiquinol  cytochrome  oxidoreductase 
ATP  synthase  FI ,  a-subunit 
Cytochrome  c  oxidase  subunit 
Fumarate  hydratase 
Malate  dehydrogenase 


Reaction  catalysed 


P-enol  pyruvate  to  pyruvate 

Pyruvate  to  lactate 

Acetyl  coA  +  oxaloacetate  to  citrate 
Isocitrate  to  2-oxogIutarate  +  CO2 
2-Oxoglutarate  to  succinyl  CoA 
Succinate  to  fumarate 

Ubiquinol  to  cytochrome  c  reductase  in  electron  transport 


Fumarate  to  malate 
Malate  to  oxaloacetate 


fte'sl-iiiiiommetabolfepathwaysca^be^  sporozoite;  mrz,  rnerozoite; tpz  trophozoite;  gmt,  gamete 

•The  sequence  coverage  (that  is,  the  percentage  of  the  protein  sequence  covered  by  identified  peptides)  measured  in  each  stage  is  reported. 
tEnzyme  Commission  (EC)  numbers  are  reported  for  each  protein. 
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frequently  organized  into  gene  clusters  on  a  chromosome.  Gene 
clustering  in  Plasmodium  species  has  been  demonstrated.  Ordered 
arrays  of  genes  involved  in  virulence  and  antigenic  variation  (for 
example,  var,  vir  and  rif  genes)  are  located  in  the  subtelomeric 
regions  of  the  chromosomes^^’^®. 

To  determine  whether  gene  clustering  exists  along  the  entire 
P.  falciparum  genome,  genes  whose  protein  products  were  detected 
in  our  analysis  were  mapped  onto  aU  14  chromosomes  in  a  stage- 
dependent  manner  (Fig.  3a).  The  2,415  proteins  identified  rep¬ 
resented  an  average  of  45%  of  the  open  reading  frames  (ORFs) 
predicted  per  chromosome.  The  number  of  protein  hits  by  chromo¬ 
some  was  similar  for  all  stages;  sporozoite,  merozoite,  trophozoite 
and  gametocyte  protein  lists  constituting  19.7%,  15.8%,  19.5%  and 
21.6%  of  the  predicted  ORFs  per  chromosomes,  respectively. 
Groups  of  three  or  more  consecutive  loci  whose  protein  products 
were  detected  in  a  particular  stage  were  defined  as  chromosomal 
casters  encoding  co-expressed  proteins  (Fig.  3b).  On  the  basis  of 
this  definition  a  total  of  98  clusters  containing  3  loci,  32  clusters 
containing  4  loci,  5  clusters  containing  5  loci,  and  3  clusters 
containing  6  loci  were  identified  (Supplementary  Table  3).  For 
each  chromosome,  the  frequency  of  finding  clusters  encoding  co¬ 
expressed  proteins  containing  3-6  adjacent  loci  markedly  exceeded 


figure  3  Distribution  of  expressed  proteins  by  chromosome,  a.  For  each  stage,  genes 
whose  products  were  detected  (coloured  vertical  bars)  are  plotted  in  the  order  they  appear 
on  their  chromosome  (grey  boxes),  b,  Groups  of  at  least  three  consecutive  expressed 
genes  are  defined  as  chromosomal  clusters  of  co-expressed  proteins.  Examples  of  such 
clusters,  circled  in  b,  are  specified  in  Table  3  and  the  complete  description  of  the  138 
clusters  can  be  found  in  Supplementary  Table  3. 


e  the  probability  of  finding  such  clusters  by  chance  (see  the  footnote 
1  of  Supplementary  Table  3  for  details  on  the  probability  calculation), 
r  Therefore,  chromosomal  clusters  encoding  co-expressed  proteins 
:  were  prevalent  in  the  P.  falciparum  genome. 

Functionally  related  genes  have  been  shown  to  cluster  in  the 
:  S.  cerevisia.^*  and  human  genomes^.  This  phenomenon  also  occurs 
1  in  P.  falciparum.  A  total  of  138  clusters  encoding  co-expressed 
•  proteins  were  identified  and  67  of  them  (49%)  contained  at  least 
two  loci  that  have  been  functionally  annotated.  Of  these  67  clusters, 

I  30  contained  at  least  two  loci  whose  annotation  clearly  indicates 
that  the  proteins  are  functionally  related.  For  example,  clusters  on 
;  chromosomes  3,  5  and  10  contained  ribosomal  proteins,  proteins 
1  involved  in  protein  modification,  and  proteins  involved  in  nucleo¬ 
tide  metabolism,  respectively  (Table  3).  Chromosome  14  contained 
a  cluster  of  four  aspartic  proteases  co-expressed  in  all  of  the  blood 
stages  (Table  3).  This  cluster  was  not  detected  in  sporozoites,  where 
no  haemoglobin  degradation  is  expected  to  occur.  Interestingly, 
whereas  the  falcipain  gene  cluster  on  chromosome  1 1  appeared  in 
our  analysis  as  a  cluster  of  co-expressed  proteins  (Supplementary 
Table  3),  the  SERA  gene  cluster  on  chromosome  2,  coding  for 
proteins  that  share  a  papain-like  sequence  motif’,  did  not.  Of  the 
ten  sporozoite-specific  clusters,  five  involved  rarand  rif  genes,  such 
as  the  rif  cluster  located  in  the  subtelomeric  domain  of  chromosome 
14  (Table  3).  On  the  basis  of  their  presence  in  clusters  encoding 
co-expressed  proteins,  we  were  able  to  suggest  functional  roles  for 
24  proteins  annotated  as  hypothetical  in  the  P.  falciparum  genome 
(Supplementary  Table  3).  For  example,  a  gametocyte-specific  clus¬ 
ter  on  chromosome  13  encoded  two  transmission-blocking  antigens 
(Pfs48/45  and  Pfs47)  and  a  hypothetical  protein,  PF13_0246,  which 
might  be  a  gametocyte  surface  protein.  Two  clusters  on  chromo¬ 
somes  2  and  1 1  were  highly  specific  to  the  trophozoite  stage  (Table 
3).  Each  of  these  clusters  contained  well-known  secreted  and  surface 
proteins,  namely  KAHRP,  PfEMP3,  antigen  332,  and  RESA,  aU  of 
which  have  been  implicated  in  knob  formation.  The  highly  coordi¬ 
nated  expression  of  these  genes  makes  the  three  hypothetical 
proteins  listed  in  these  trophozoite-specific  gene  clusters  possible 
candidates  for  involvement  in  cyto-adherence. 

Discussion 

^though  sample  handling  is  a  principal  consideration  when  study¬ 
ing  pathogens,  the  expression  of  large  numbers  of  previously 
identified  proteins  was  consistent  with  their  published  expression 
profiles,  validating  our  data  set  as  a  meaningful  sampling  of  each 
stage  s  proteome.  This  is  a  particularly  important  aspect  of  our 
analysis  as  65%  of  the  5,276  genes  encoded  by  the  P.  falciparum 
genome  are  annotated  as  hypothetical',  and  of  the  2,415  expressed 
proteins  we  identified,  51%  are  hypothetical  proteins  (Supplemen¬ 
tary  Table  1).  Our  results  confirmed  that  these  hypothetical  ORFs 
predicted  by  gene  modelling  algorithms  were  indeed  coding  regions. 
Furthermore,  from  all  four  stages  analysed,  we  identified  439 
proteins  predicted  to  have  at  least  one  transmembrane  segment  or 
a  GPI  addition  signal  ( 18%  of  the  data  set)  and  304  soluble  proteins 
with  a  signal  sequence;  that  is,  potentiaUy  secreted  or  located  to 
organelles.  Well  over  half  of  the  secreted  proteins  and  integral 
membrane  proteins  detected  were  annotated  as  hypothetical 
(Supplementary  Table  4).  The  obvious  interest  in  this  class  of 
proteins  is  that,  with  no  homology  to  known  proteins,  they 
represent  potential  Plasmodium-specific  proteins  and  may  provide 
targets  for  new  drug  and  vaccine  development. 

Our  comprehensive  large-scale  analysis  of  protein  expression 
showed  that  most  surface  proteins  are  more  widely  expressed  than 
initially  thought.  In  particular,  the  var  and  rif  genes,  which  were 
thought  to  be  involved  in  immune  evasion  only  in  the  blood  stage, 
have  now  been  shown  to  be  expressed  in  apparently  large  and  varied 
numbers  at  the  sporozoite  stage.  These  surface  proteins  might  be 
involved  in  general  interaction  processes  with  host  cells  and/or 
immune  evasion.  An  alternative  hypothesis  is  that  stage-specific 
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Tabte  3  Examples  of  chromosomal  gene  clusters  encoding  co-expressed  proteins 

Stage 


Chromosome 


PFC0285C 

PFC0290W 

PFC0295C 

PFC0300C 

PFE1345C 

PFE1350C 

PFE1355 

PFE1360C 

PF10_0121 

PF10_0122 

PF10_0123 

PF10_0124 

PF14_CX)74 

PF14_0075 

PF14_0076 

PF14_0077 

PF14_0078 

PF14_0002 

PF14_0(D03 

PF14_0004 

PFB0090C 

PFB0095C 

PFB0100C 

PF11_0506 

PF11_0507 

PF11_0508 

PF11_0509 

PF13_0246 

PF13_0247 


Spz 

Mrz 

Tpz 

Gmt 

Description 

2.1 

12.7 

33.2 

18.7 

T-complex  protein  3 -subunit 

8.3 

_ 

33.8 

18.6 

40S  ribosomal  protein  S23 

14.9 

52.5 

21.3 

40S  ribosomal  protein  SI  2 

_ 

12.1 

30.4 

17.9 

60S  ribosomal  protein  L7 

1.9 

1.6 

Minichromosome  maintenance 

protein  3 

_ 

22.4 

- 

Ubiquitin-conjugating  enzyme 

_ 

4.8 

2.6 

2.6 

Ubiquitin  carboxy-terminal  hydrolase 

_ 

_ 

7.7 

_ 

Methionine  aminopeptidase 

10.8 

74.5 

29 

Hypoxanthine  phosphoribosyltransferase 

5.4 

6.1 

- 

6.1 

Phosphoglucomutase 

11.7 

_ 

- 

GMP  synthetase 

0.9 

1.8 

_ 

Hypothetical  protein 

26.6 

_ 

_ 

4.9 

Hypothetical  protein 

26.5 

43.2 

47.4 

Plasmepsin 

_ 

6.6 

35.2 

10 

Plasmepsin  1 

_ 

21.2 

43 

11.5 

Plasmepsin  2 

14.2 

52.8 

29.9 

HAP  protein 

3.5 

_ 

- 

- 

Rifin 

7.9 

_ 

- 

-  ’ 

Rifin 

6.5 

_ 

- 

Rifin 

_ 

3 

- 

Hypothetical  protein,  conserved 

_ 

3.4 

- 

Erythrocyte  membrane  protein  3 

_ 

1.5 

24.8 

- 

Knob-associated  histidine-rich  protein 

_ 

6.3 

4.4 

Hypothetical  protein 

_ 

0.8 

- 

Antigen  332 

_ 

3.3 

- 

Hypothetical  protein 

_ 

6.4 

3 

- 

RESA 

4.5 

_ 

- 

8.6 

Hypothetical  protein 

_ 

32.4 

Transmission-blocking  target  antigen 

precursor  (Pfs48/45) 

_ 

_ 

7.1 

Transmission-blocking  target  antigen 

precursor  {Pfs47) 

Protein  fate 
Protein  synthesis 
Protein  synthesis 
Protein  synthesis 
Cell  transport 

Protein  fate 
Protein  fate 
Protein  fate 
Metabolism 
Metabolism 
Metabolism 


Protein  fate 
Protein  fate 
Protein  fate 
Protein  fate 
Surface  or  organelles 
Surface  or  organelles 
Surface  or  organelles 

Surface  or  organelles 
Surface  or  organelles 

Surface  or  organelles 

Surface  or  organelles 

Surface  or  organelles 

Surface  or  organelles 


based  on  a  hidden  Markov  model  (HMM),  big-PI  Predictor"  and  SignalP^  algonthms). 


regulation  is  not  as  exact  as  previously  thought. 

One  mechanism  of  protein  expression  control  that  contributes  to 
stage  specificity  in  P.  falciparum  arises  from  the  chromosomal 
clustering  of  genes  encoding  co-expressed  proteins.  The  clusters 
described  in  this  study  demonstrate  a  widespread  high  order  of 
chromosomal  organization  in  R  falciparum  and  probably  corre¬ 
spond  to  regions  of  open  chromatin  allowing  for  co-regulated  gene 
expression.  The  high  (A  -f  T)  content  of  the  P.  falciparum  genome 
makes  the  identification  of  regulatory  sequences  such  as  promoters 
and  enhancers  challenging^ Focusing  analyses  on  stage-specific 
and  multi-stage  clusters  will  facilitate  finding  stage-specific  and 
general  ds-acting  sequences  in  the  Plasmodium  genome  and  will 
help  decipher  gene  expression  regulation  during  the  parasite  life 
cycle. 

The  malaria  parasite  is  a  complex  multi-stage  organism,  which 
has  co-evolved  in  mosquitoes  and  vertebrates  for  millions  of  years. 
Designing  drugs  or  vaccines  that  substantially  and  persistently 
interrupt  the  life  cycle  of  this  complex  parasite  will  require  a 
comprehensive  understanding  of  its  biology.  The  P  falciparum 
genome  sequence  and  comparative  proteomics  approaches  may 
initiate  new  strategies  for  controlling  the  devastating  disease  caused 
by  this  parasite.  ^ 

Methods 

Parasite  material 

Plasmodium  falciparum  clone  3D7  (Oxford)  was  used  throughout.  Sporozoites  were 
initially  isolated  from  the  salivary  glands  of  Anopheles  stephansi  mosquitoes,  14  days  after 
infection,  by  centrifugation  in  a  Renograffin  60  gradient,  as  described^^.  Four  sporozoite 
samples  were  used  as  is.  A  fifth  sample  underwent  an  additional  purification  step  on 
Dynabeads  M-450  Epoxy  coupled  to  NFSl  (an  anti-P.  falciparum  CS  protein  monoclonal 
antibody)^  according  to  the  manufecturer’s  instructions  (Dynal).  Trophozoite-infected 
erythrocytes  from  synchronized  cultures  were  purified  on  70%  PercoU-alanine^®,  and  the 
trophozoites  released  from  the  erythrocytes^^.  Of  the  of  260  parasitized  erythrocytes 
counted  by  Giemsa-stained  thin-blood  film,  100%  were  identified  as  trophozoites. 
Merozoites  were  prepared  essentially  as  described  in  ref.  36,  using  highly  synchronized 


schizonts  and  purifying  the  merozoites  by  passage  through  membrane  filters.  Starting  with 
synchronized  asexual  parasites  grown  in  suspension  culture  as  described^^*’*,  gametocytes 
were  prepared  by  daily  media  changes  of  static  cultures  at  37  °C.  When  there  were  very  few 
mature  asexual  stages  present,  gametocyte-infected  erythrocytes  were  collected  from  the 
52.5%/45%  and  45%/30%  interfaces  of  a  Percoll  gradient^^  The  gametocytes  consisted 
mostly  of  stage  IV  and  V  parasites  with  minor  contamination  (<3%)  from  mixed  asexual 
stage  parasites.  Finally,  cellular  debris  from  the  upper  bodies  of  parasite-free  A.  stephansi 
and  non-infected  human  erythrocytes  were  used  as  controls  for  sporozoites  and  blood- 
stage  parasites,  respectively.  Every  effort  was  made  to  minimize  enzymatic  activity  and 
protein  degradation  during  sampling,  and  the  subsequent  isolation  of  the  parasites; 
however,  we  cannot  exclude  that  some  of  the  differences  in  protein  profiles  that  we  observe 
between  the  different  life-cycle  stages  may  be  a  consequence  of  the  sample-handling 
procedures. 

Ceil  lysis 

Five  sporozoite,  four  merozoite,  four  trophozoite  and  three  gametocyte  preparations  were 
lysed,  digested  and  analysed  independently.  Cell  pellets  were  first  diluted  ten  times  in 
100  mM  Tris-HCl  pH  8.5,  and  incubated  in  ice  for  1  h.  After  centrifugation  at  18,000gfor 
30  min,  supernatants  were  set  aside  and  microsomal  membrane  pellets  were  washed  in 

O. 1  M  sodium  carbonate,  pH  11.6.  Soluble  and  insoluble  protein  fractions  were  separated 
by  centrifugation  at  18,000  g  for  30  min.  Supernatants  obtained  from  both  centrifugation 
steps  were  either  combined  (sporozoites,  trophozoites  and  merozoites)  or  digested  and 
analysed  independently  (gametocytes). 

Peptide  generation  and  analysis 

The  method  follows  that  of  Washburn  et  al\  with  the  exception  that  Tris(2- 
carboxyethyl)phosphine  hydrochloride  (TCEP-HCl;  Pierce)  was  used  to  reduce  urea- 
denatured  proteins.  Peptide  mixtures  were  analysed  through  MudPIT  as  described^. 

Protein  sequence  databases 

The  P.  falciparum  database  contained  5,283  protein  sequences.  Spectra  resulting  from 
contaminant  mosquito  and  erythrocyte  peptides  had  to  be  taken  into  account  in  the 
sporozoite  and  blood-stage  samples,  respectively.  Tandem  mass  spectrometry  (MS/MS) 
data  sets  from  blood  stages  were  therefore  searched  against  a  database  containing  both 

P.  falciparum  protein  sequences  and  24,006  ORFs  from  the  human,  mouse  and  rat  Refieq 
NCBI  databases.  At  the  date  of  the  searches,  the  Anopheles  gambiae  genome  was  not 
available.  The  NCBI  database  contained  922  Anopheles  and  313  Aedes  proteins,  which  were 
combined  to  the  14,335  ORFs  of  the  NCBI  Drosophila  melanogaster^  database  to  create  a 
control  diptera  database.  Finally,  these  databases  were  complemented  with  a  set  of  172 
known  protein  contaminants,  such  as  proteases,  bovine  serum  albumin  and  human 
keratins. 
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MS/MS  data  set  analysis 

The  SEQUEST  algorithm  was  used  to  match  MS/MS  spectra  to  peptides  in  the  sequence 
databases"'.  To  account  for  carboxyamidomethylation,  MS/MS  data  sets  were  searched 
with  a  relative  molecular  mass  of 57,000  (M^  57K)  added  to  the  average  molecular  mass  of 
cysteines.  Peptide  hits  were  filtered  and  sorted  with  DTASelect"^  Spectra/peptide  matches 
were  only  retained  if  they  were  at  least  half-tryptic  (Lys  or  Arg  at  either  end  of  the  identified 
peptide)  and  with  minimum  cross-correlation  scores  (XCofr)  of  1.8  for  -HI,  2.5  for  -H2, 
and  3.5  for  +3  spectra  and  DeltaCn  (top  match’s  XCorr  minus  the  second-best  match’s 
XCorr  divided  by  the  top  match’s  XCorr)  of  0.08.  Peptide  hits  were  deemed  unambiguous 
only  if  they  were  not  found  in  non-infected  controls  and  were  uniquely  assigned  to 
parasite  proteins  by  searching  against  combined  parasite-host  databases.  Finally,  for  low 
coverage  loci,  peplide/spectrum  matches  were  visually  assessed  on  two  main  criteria:  any 
given  MS/MS  spectrum  had  to  be  clearly  above  the  baseline  noise,  and  both  b  and  y  ion 
series  had  to  show  continuity.  The  Contrast  tool"'  was  used  to  compare  and  merge  protein 
lists  from  replicate  sample  runs  and  to  compare  the  proteomes  established  for  the  four 
stages. 
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